ansible-playbook update_pgcluster.yml -e target=system fails #644

chuegel · 2024-04-30T06:54:56Z

During our scheduled update of the postgresql cluster, I noticed that the playbook failed:

{
  "redirected": false,
  "url": "http://10.83.200.12:8008/replica",
  "status": 503,
  "server": "BaseHTTP/0.6 Python/3.10.12",
  "date": "Tue, 30 Apr 2024 06:40:38 GMT",
  "content_type": "application/json",
  "elapsed": 0,
  "changed": false,
  "json": {
    "state": "running",
    "postmaster_start_time": "2024-03-05 15:25:17.225621+01:00",
    "role": "master",
    "server_version": 150006,
    "xlog": {
      "location": 191966088696
    },
    "timeline": 15,
    "replication": [
      {
        "usename": "replicator",
        "application_name": "postgresql03",
        "client_addr": "10.83.200.14",
        "state": "streaming",
        "sync_state": "async",
        "sync_priority": 0
      }
    ],
    "dcs_last_seen": 1714459232,
    "database_system_identifier": "7253014758852064969",
    "patroni": {
      "version": "3.2.2",
      "scope": "postgres-cluster",
      "name": "postgresql01"
    }
  },
  "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable",
  "invocation": {
    "module_args": {
      "url": "http://10.83.200.12:8008/replica",
      "status_code": [
        200
      ],
      "force": false,
      "http_agent": "ansible-httpget",
      "use_proxy": true,
      "validate_certs": true,
      "force_basic_auth": false,
      "use_gssapi": false,
      "body_format": "raw",
      "method": "GET",
      "return_content": false,
      "follow_redirects": "safe",
      "timeout": 30,
      "headers": {},
      "remote_src": false,
      "unredirected_headers": [],
      "decompress": true,
      "use_netrc": true,
      "unsafe_writes": false,
      "url_username": null,
      "url_password": null,
      "client_cert": null,
      "client_key": null,
      "dest": null,
      "body": null,
      "src": null,
      "creates": null,
      "removes": null,
      "unix_socket": null,
      "ca_path": null,
      "ciphers": null,
      "mode": null,
      "owner": null,
      "group": null,
      "seuser": null,
      "serole": null,
      "selevel": null,
      "setype": null,
      "attributes": null
    }
  },
  "_ansible_no_log": false,
  "attempts": 300
}

{
  "cmd": "patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster",
  "stdout": "Current cluster topology\r\n+ Cluster: postgres-cluster (7253014758852064969) ------------+----+-----------+\r\n| Member       | Host         | Role    | State               | TL | Lag in MB |\r\n+--------------+--------------+---------+---------------------+----+-----------+\r\n| postgresql01 | 10.83.200.12 | Leader  | running             | 15 |           |\r\n| postgresql02 | 10.83.200.13 | Replica | in archive recovery | 15 |     20497 |\r\n| postgresql03 | 10.83.200.14 | Replica | streaming           | 15 |         0 |\r\n+--------------+--------------+---------+---------------------+----+-----------+\r\nPrimary [postgresql01]: Candidate ['postgresql02', 'postgresql03'] []: When should the switchover take place (e.g. 2024-04-30T09:28 )  [now]: Are you sure you want to switchover cluster postgres-cluster, demoting current leader postgresql01? [y/N]: Switchover failed, details: 503, Switchover failed",
  "rc": 0,
  "start": "2024-04-30 08:28:09.489857",
  "end": "2024-04-30 08:28:13.165841",
  "delta": "0:00:03.675984",
  "changed": true,
  "invocation": {
    "module_args": {
      "command": "patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster",
      "responses": {
        "(.*)Primary(.*)": "postgresql01",
        "(.*)Candidate(.*)": "postgresql02",
        "(.*)When should the switchover take place(.*)": "now",
        "(.*)Are you sure you want to switchover cluster(.*)": "y"
      },
      "timeout": 30,
      "echo": false,
      "chdir": null,
      "creates": null,
      "removes": null
    }
  },
  "stdout_lines": [
    "Current cluster topology",
    "+ Cluster: postgres-cluster (7253014758852064969) ------------+----+-----------+",
    "| Member       | Host         | Role    | State               | TL | Lag in MB |",
    "+--------------+--------------+---------+---------------------+----+-----------+",
    "| postgresql01 | 10.83.200.12 | Leader  | running             | 15 |           |",
    "| postgresql02 | 10.83.200.13 | Replica | in archive recovery | 15 |     20497 |",
    "| postgresql03 | 10.83.200.14 | Replica | streaming           | 15 |         0 |",
    "+--------------+--------------+---------+---------------------+----+-----------+",
    "Primary [postgresql01]: Candidate ['postgresql02', 'postgresql03'] []: When should the switchover take place (e.g. 2024-04-30T09:28 )  [now]: Are you sure you want to switchover cluster postgres-cluster, demoting current leader postgresql01? [y/N]: Switchover failed, details: 503, Switchover failed"
  ],
  "_ansible_no_log": false
}

However, the upgrade of the two replica nodes went well.

Any hints how to recover the actual state of the cluster?
Thanks

The text was updated successfully, but these errors were encountered:

vitabaks · 2024-04-30T07:15:00Z

Please attach ansible log

chuegel · 2024-04-30T07:28:47Z

Thanks for your reply.
So, digging deeper into the logs it turned out that the version on the pgbackrest server was lagging behind:

2024-04-30 00:00:06.440 P00   INFO: archive-get command end: aborted with exception [103]
2024-04-30 00:00:06 CEST [442637-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 00:00:06 CEST [442637-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 00:00:06.490 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=442639-759ac39e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
WARN: repo1: [ProtocolError] expected value '2.51' for greeting key 'version' but got '2.50'
      HINT: is the same version of pgBackRest installed on the local and remote host?
ERROR: [103]: unable to find a valid repository

After upgrading pgbackrest server now I get this error:

2024-04-30 09:25:12 CEST [29612-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 09:25:12 CEST [29612-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 09:25:12.412 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=29614-ee35ac41 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 09:25:12.670 P00   INFO: unable to find 00000010.history in the archive
2024-04-30 09:25:12.771 P00   INFO: archive-get command end: completed successfully (363ms)
2024-04-30 09:25:12 CEST [859-806]  LOG:  waiting for WAL to become available at 27/B0002000

chuegel · 2024-04-30T08:28:38Z

The other replica seems to be fine:

2024-04-30 09:57:30 CEST [700-23]  LOG:  recovery restart point at 2C/B4010EA8
2024-04-30 09:57:30 CEST [700-24]  DETAIL:  Last completed transaction was at log time 2024-04-30 09:57:22.977464+02.
2024-04-30 10:12:28 CEST [700-25]  LOG:  restartpoint starting: time
2024-04-30 10:12:30 CEST [700-26]  LOG:  restartpoint complete: wrote 21 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=2.018 s, sync=0.003 s, total=2.037 s; sync files=16, longest=0.002 s, average=0.001 s; distance=46 kB, estimate=14701 kB

vitabaks · 2024-04-30T08:36:38Z

It is strange that the pgbackrest package has not been updated with target=system

vitabaks · 2024-04-30T08:40:01Z

I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.

P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.

chuegel · 2024-04-30T08:56:35Z

I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.

P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.

Yes, I use a dedicated pgbackrest server. After aligning the versions, the one replica complains with:

2024-04-30 10:29:30 CEST [56835-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 10:29:30 CEST [56835-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 10:29:30.877 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=56837-4247d19a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 10:29:31.129 P00   INFO: unable to find 00000010.history in the archive

But there is no 00000010.history on the pgbackrest server:

ls -la /var/lib/pgbackrest/archive/postgres-cluster/15-1/
total 192
drwxr-x--- 7 postgres postgres  4096 Apr 27 00:01 .
drwxr-x--- 3 postgres postgres  4096 Apr 30 00:01 ..
-rw-r----- 1 postgres postgres   610 Mar  5 15:35 0000000F.history
drwxr-x--- 2 postgres postgres 36864 Apr 27 00:01 0000000F00000028
drwxr-x--- 2 postgres postgres 32768 Apr 16 07:01 0000000F00000029
drwxr-x--- 2 postgres postgres 36864 Apr 21 12:31 0000000F0000002A
drwxr-x--- 2 postgres postgres 36864 Apr 26 17:31 0000000F0000002B
drwxr-x--- 2 postgres postgres 24576 Apr 30 09:01 0000000F0000002C

chuegel · 2024-04-30T09:08:58Z

It is strange that the pgbackrest package has not been updated with target=system

Thats because the playbook didn't run agains the pgbackrest host. Not sure why

vitabaks · 2024-04-30T09:25:31Z

The playbook is designed to update the postgres cluster, not the backup server.

chuegel · 2024-04-30T09:35:54Z

I understand. The pgbackup package was update on the replicas successully. On the leader, it does a switchover before upgrading packages. Since the switchover failed, the leader had also a older version of pgbackrest.
I did manually upgrade the pgbackrest package on leader and pgbackrest server:
apt install --only-upgrade pgbackrest

I'm not quite sure which steps to take to recover the one replica.

vitabaks · 2024-04-30T10:10:03Z

Try reinit replica
patronictl reinit postgres-cluster <problem replica name>

After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml playbook again to complete the cluster update.

vitabaks · 2024-04-30T10:10:38Z

I will also add an update the pgbackrest package on the backup server in automation.

UPD: #648

chuegel · 2024-04-30T10:18:33Z

Try reinit replica patronictl reinit postgres-cluster <problem replica name>

After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml playbook again to complete the cluster update.

That worked!!!

patronictl list postgres-cluster
+ Cluster: postgres-cluster (7253014758852064969) --+----+-----------+
| Member       | Host         | Role    | State     | TL | Lag in MB |
+--------------+--------------+---------+-----------+----+-----------+
| postgresql01 | 10.83.200.12 | Leader  | running   | 15 |           |
| postgresql02 | 10.83.200.13 | Replica | streaming | 15 |         0 |
| postgresql03 | 10.83.200.14 | Replica | streaming | 15 |         0 |
+--------------+--------------+---------+-----------+----+-----------+

Thank you, Sir!

vitabaks added the wontfix This will not be worked on label Apr 30, 2024

chuegel closed this as completed Apr 30, 2024

vitabaks mentioned this issue May 3, 2024

update_pgcluster.yml: Update pgBackRest package on the backup server #648

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ansible-playbook update_pgcluster.yml -e target=system fails #644

ansible-playbook update_pgcluster.yml -e target=system fails #644

chuegel commented Apr 30, 2024

vitabaks commented Apr 30, 2024

chuegel commented Apr 30, 2024

chuegel commented Apr 30, 2024

vitabaks commented Apr 30, 2024

vitabaks commented Apr 30, 2024

chuegel commented Apr 30, 2024

chuegel commented Apr 30, 2024

vitabaks commented Apr 30, 2024

chuegel commented Apr 30, 2024 •

edited

Loading

vitabaks commented Apr 30, 2024 •

edited

Loading

vitabaks commented Apr 30, 2024 •

edited

Loading

chuegel commented Apr 30, 2024

ansible-playbook update_pgcluster.yml -e target=system fails #644

ansible-playbook update_pgcluster.yml -e target=system fails #644

Comments

chuegel commented Apr 30, 2024

vitabaks commented Apr 30, 2024

chuegel commented Apr 30, 2024

chuegel commented Apr 30, 2024

vitabaks commented Apr 30, 2024

vitabaks commented Apr 30, 2024

chuegel commented Apr 30, 2024

chuegel commented Apr 30, 2024

vitabaks commented Apr 30, 2024

chuegel commented Apr 30, 2024 • edited Loading

vitabaks commented Apr 30, 2024 • edited Loading

vitabaks commented Apr 30, 2024 • edited Loading

chuegel commented Apr 30, 2024

chuegel commented Apr 30, 2024 •

edited

Loading

vitabaks commented Apr 30, 2024 •

edited

Loading

vitabaks commented Apr 30, 2024 •

edited

Loading