Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible-playbook update_pgcluster.yml -e target=system fails #644

Closed
chuegel opened this issue Apr 30, 2024 · 12 comments
Closed

ansible-playbook update_pgcluster.yml -e target=system fails #644

chuegel opened this issue Apr 30, 2024 · 12 comments
Labels
wontfix This will not be worked on

Comments

@chuegel
Copy link
Contributor

chuegel commented Apr 30, 2024

During our scheduled update of the postgresql cluster, I noticed that the playbook failed:

{
  "redirected": false,
  "url": "http://10.83.200.12:8008/replica",
  "status": 503,
  "server": "BaseHTTP/0.6 Python/3.10.12",
  "date": "Tue, 30 Apr 2024 06:40:38 GMT",
  "content_type": "application/json",
  "elapsed": 0,
  "changed": false,
  "json": {
    "state": "running",
    "postmaster_start_time": "2024-03-05 15:25:17.225621+01:00",
    "role": "master",
    "server_version": 150006,
    "xlog": {
      "location": 191966088696
    },
    "timeline": 15,
    "replication": [
      {
        "usename": "replicator",
        "application_name": "postgresql03",
        "client_addr": "10.83.200.14",
        "state": "streaming",
        "sync_state": "async",
        "sync_priority": 0
      }
    ],
    "dcs_last_seen": 1714459232,
    "database_system_identifier": "7253014758852064969",
    "patroni": {
      "version": "3.2.2",
      "scope": "postgres-cluster",
      "name": "postgresql01"
    }
  },
  "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable",
  "invocation": {
    "module_args": {
      "url": "http://10.83.200.12:8008/replica",
      "status_code": [
        200
      ],
      "force": false,
      "http_agent": "ansible-httpget",
      "use_proxy": true,
      "validate_certs": true,
      "force_basic_auth": false,
      "use_gssapi": false,
      "body_format": "raw",
      "method": "GET",
      "return_content": false,
      "follow_redirects": "safe",
      "timeout": 30,
      "headers": {},
      "remote_src": false,
      "unredirected_headers": [],
      "decompress": true,
      "use_netrc": true,
      "unsafe_writes": false,
      "url_username": null,
      "url_password": null,
      "client_cert": null,
      "client_key": null,
      "dest": null,
      "body": null,
      "src": null,
      "creates": null,
      "removes": null,
      "unix_socket": null,
      "ca_path": null,
      "ciphers": null,
      "mode": null,
      "owner": null,
      "group": null,
      "seuser": null,
      "serole": null,
      "selevel": null,
      "setype": null,
      "attributes": null
    }
  },
  "_ansible_no_log": false,
  "attempts": 300
}
{
  "cmd": "patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster",
  "stdout": "Current cluster topology\r\n+ Cluster: postgres-cluster (7253014758852064969) ------------+----+-----------+\r\n| Member       | Host         | Role    | State               | TL | Lag in MB |\r\n+--------------+--------------+---------+---------------------+----+-----------+\r\n| postgresql01 | 10.83.200.12 | Leader  | running             | 15 |           |\r\n| postgresql02 | 10.83.200.13 | Replica | in archive recovery | 15 |     20497 |\r\n| postgresql03 | 10.83.200.14 | Replica | streaming           | 15 |         0 |\r\n+--------------+--------------+---------+---------------------+----+-----------+\r\nPrimary [postgresql01]: Candidate ['postgresql02', 'postgresql03'] []: When should the switchover take place (e.g. 2024-04-30T09:28 )  [now]: Are you sure you want to switchover cluster postgres-cluster, demoting current leader postgresql01? [y/N]: Switchover failed, details: 503, Switchover failed",
  "rc": 0,
  "start": "2024-04-30 08:28:09.489857",
  "end": "2024-04-30 08:28:13.165841",
  "delta": "0:00:03.675984",
  "changed": true,
  "invocation": {
    "module_args": {
      "command": "patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster",
      "responses": {
        "(.*)Primary(.*)": "postgresql01",
        "(.*)Candidate(.*)": "postgresql02",
        "(.*)When should the switchover take place(.*)": "now",
        "(.*)Are you sure you want to switchover cluster(.*)": "y"
      },
      "timeout": 30,
      "echo": false,
      "chdir": null,
      "creates": null,
      "removes": null
    }
  },
  "stdout_lines": [
    "Current cluster topology",
    "+ Cluster: postgres-cluster (7253014758852064969) ------------+----+-----------+",
    "| Member       | Host         | Role    | State               | TL | Lag in MB |",
    "+--------------+--------------+---------+---------------------+----+-----------+",
    "| postgresql01 | 10.83.200.12 | Leader  | running             | 15 |           |",
    "| postgresql02 | 10.83.200.13 | Replica | in archive recovery | 15 |     20497 |",
    "| postgresql03 | 10.83.200.14 | Replica | streaming           | 15 |         0 |",
    "+--------------+--------------+---------+---------------------+----+-----------+",
    "Primary [postgresql01]: Candidate ['postgresql02', 'postgresql03'] []: When should the switchover take place (e.g. 2024-04-30T09:28 )  [now]: Are you sure you want to switchover cluster postgres-cluster, demoting current leader postgresql01? [y/N]: Switchover failed, details: 503, Switchover failed"
  ],
  "_ansible_no_log": false
}

However, the upgrade of the two replica nodes went well.

Any hints how to recover the actual state of the cluster?
Thanks

@vitabaks
Copy link
Owner

Please attach ansible log

@chuegel
Copy link
Contributor Author

chuegel commented Apr 30, 2024

Thanks for your reply.
So, digging deeper into the logs it turned out that the version on the pgbackrest server was lagging behind:

2024-04-30 00:00:06.440 P00   INFO: archive-get command end: aborted with exception [103]
2024-04-30 00:00:06 CEST [442637-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 00:00:06 CEST [442637-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 00:00:06.490 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=442639-759ac39e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
WARN: repo1: [ProtocolError] expected value '2.51' for greeting key 'version' but got '2.50'
      HINT: is the same version of pgBackRest installed on the local and remote host?
ERROR: [103]: unable to find a valid repository

After upgrading pgbackrest server now I get this error:

2024-04-30 09:25:12 CEST [29612-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 09:25:12 CEST [29612-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 09:25:12.412 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=29614-ee35ac41 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 09:25:12.670 P00   INFO: unable to find 00000010.history in the archive
2024-04-30 09:25:12.771 P00   INFO: archive-get command end: completed successfully (363ms)
2024-04-30 09:25:12 CEST [859-806]  LOG:  waiting for WAL to become available at 27/B0002000

@chuegel
Copy link
Contributor Author

chuegel commented Apr 30, 2024

The other replica seems to be fine:

2024-04-30 09:57:30 CEST [700-23]  LOG:  recovery restart point at 2C/B4010EA8
2024-04-30 09:57:30 CEST [700-24]  DETAIL:  Last completed transaction was at log time 2024-04-30 09:57:22.977464+02.
2024-04-30 10:12:28 CEST [700-25]  LOG:  restartpoint starting: time
2024-04-30 10:12:30 CEST [700-26]  LOG:  restartpoint complete: wrote 21 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=2.018 s, sync=0.003 s, total=2.037 s; sync files=16, longest=0.002 s, average=0.001 s; distance=46 kB, estimate=14701 kB

@vitabaks
Copy link
Owner

It is strange that the pgbackrest package has not been updated with target=system

@vitabaks
Copy link
Owner

I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.

P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.

@chuegel
Copy link
Contributor Author

chuegel commented Apr 30, 2024

I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.

P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.

Yes, I use a dedicated pgbackrest server. After aligning the versions, the one replica complains with:

2024-04-30 10:29:30 CEST [56835-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 10:29:30 CEST [56835-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 10:29:30.877 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=56837-4247d19a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 10:29:31.129 P00   INFO: unable to find 00000010.history in the archive

But there is no 00000010.history on the pgbackrest server:

ls -la /var/lib/pgbackrest/archive/postgres-cluster/15-1/
total 192
drwxr-x--- 7 postgres postgres  4096 Apr 27 00:01 .
drwxr-x--- 3 postgres postgres  4096 Apr 30 00:01 ..
-rw-r----- 1 postgres postgres   610 Mar  5 15:35 0000000F.history
drwxr-x--- 2 postgres postgres 36864 Apr 27 00:01 0000000F00000028
drwxr-x--- 2 postgres postgres 32768 Apr 16 07:01 0000000F00000029
drwxr-x--- 2 postgres postgres 36864 Apr 21 12:31 0000000F0000002A
drwxr-x--- 2 postgres postgres 36864 Apr 26 17:31 0000000F0000002B
drwxr-x--- 2 postgres postgres 24576 Apr 30 09:01 0000000F0000002C

@chuegel
Copy link
Contributor Author

chuegel commented Apr 30, 2024

It is strange that the pgbackrest package has not been updated with target=system

Thats because the playbook didn't run agains the pgbackrest host. Not sure why

@vitabaks
Copy link
Owner

The playbook is designed to update the postgres cluster, not the backup server.

@chuegel
Copy link
Contributor Author

chuegel commented Apr 30, 2024

I understand. The pgbackup package was update on the replicas successully. On the leader, it does a switchover before upgrading packages. Since the switchover failed, the leader had also a older version of pgbackrest.
I did manually upgrade the pgbackrest package on leader and pgbackrest server:
apt install --only-upgrade pgbackrest

I'm not quite sure which steps to take to recover the one replica.

@vitabaks
Copy link
Owner

vitabaks commented Apr 30, 2024

Try reinit replica
patronictl reinit postgres-cluster <problem replica name>

After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml playbook again to complete the cluster update.

@vitabaks
Copy link
Owner

vitabaks commented Apr 30, 2024

I will also add an update the pgbackrest package on the backup server in automation.

UPD: #648

@vitabaks vitabaks added the wontfix This will not be worked on label Apr 30, 2024
@chuegel
Copy link
Contributor Author

chuegel commented Apr 30, 2024

Try reinit replica patronictl reinit postgres-cluster <problem replica name>

After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml playbook again to complete the cluster update.

That worked!!!

patronictl list postgres-cluster
+ Cluster: postgres-cluster (7253014758852064969) --+----+-----------+
| Member       | Host         | Role    | State     | TL | Lag in MB |
+--------------+--------------+---------+-----------+----+-----------+
| postgresql01 | 10.83.200.12 | Leader  | running   | 15 |           |
| postgresql02 | 10.83.200.13 | Replica | streaming | 15 |         0 |
| postgresql03 | 10.83.200.14 | Replica | streaming | 15 |         0 |
+--------------+--------------+---------+-----------+----+-----------+

Thank you, Sir!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants
@chuegel @vitabaks and others