Skip to content

Commit

Permalink
Add playbook: update_pgcluster (#281)
Browse files Browse the repository at this point in the history
  • Loading branch information
vitabaks authored Mar 18, 2023
1 parent 446f6de commit c843528
Show file tree
Hide file tree
Showing 14 changed files with 873 additions and 1 deletion.
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ In addition to deploying new clusters, this playbook also support the deployment
- [Create cluster with WAL-G:](#create-cluster-with-wal-g)
- [Point-In-Time-Recovery:](#point-in-time-recovery)
- [Maintenance](#maintenance)
- [Update the PostgreSQL HA Cluster](#update-the-postgresql-ha-cluster)
- [Using Git for cluster configuration management](#using-git-for-cluster-configuration-management-iacgitops)
- [Disaster Recovery](#disaster-recovery)
- [etcd](#etcd)
- [PostgreSQL (databases)](#postgresql-databases)
Expand Down Expand Up @@ -460,7 +462,27 @@ I recommend that you study the following materials for further maintenance of th
- [Patroni documentation](https://patroni.readthedocs.io/en/latest/)
- [etcd operations guide](https://etcd.io/docs/v3.3.12/op-guide/)

## Using Git for cluster configuration management (IaC/GitOps)
#### Update the PostgreSQL HA Cluster

`update_pgcluster.yml` playbook is designed to update the PostgreSQL HA Cluster, to a new minor version (for example 15.1->15.2, and etc).

Usage:

- Update PostgreSQL:

`ansible-playbook update_pgcluster.yml`

- Update Patroni:

`ansible-playbook update_pgcluster.yml -e target=patroni`

- Update all system:

`ansible-playbook update_pgcluster.yml -e target=system`

More details [here](roles/update)

#### Using Git for cluster configuration management (IaC/GitOps)

Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through manual processes. \
GitOps automates infrastructure updates using a Git workflow with continuous integration (CI) and continuous delivery (CI/CD). When new code is merged, the CI/CD pipeline enacts the change in the environment. Any configuration drift, such as manual changes or errors, is overwritten by GitOps automation so the environment converges on the desired state defined in Git.
Expand Down
129 changes: 129 additions & 0 deletions roles/update/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
## Update the PostgreSQL HA Cluster

This role is designed to update the PostgreSQL HA cluster to a new minor version (for example, 15.1->15.2, and etc).

By default, only PostgreSQL packages defined in the postgresql_packages variable are updated (vars/Debian.yml or vars/RedHat.yml). In addition, you can update Patroni or the entire system.

#### Usage

Update PostgreSQL:

`ansible-playbook update_pgcluster.yml`

Update Patroni:

`ansible-playbook update_pgcluster.yml -e target=patroni`

Update all system packages:

`ansible-playbook update_pgcluster.yml -e target=system`


#### Variables

- `target`
- Defines the target for the update.
- Available values: 'postgres', 'patroni', 'system'
- Default value: postgres
- `max_replication_lag_bytes`
- Determines the size of the replication lag above which the update will not be performed.
- If the lag is high, you will be prompted to try again later.
- Default value: 10485760 (10 MiB)
- `max_transaction_sec`
- Determines the maximum transaction time, in the presence of which the update will not be performed.
- If long-running transactions are present, you will be prompted to try again later.
- Default value: 15 (seconds)
- `update_extensions`
- If 'true', an attempt will be made to automatically update all extensions for all databases.
- Specify 'false', to avoid updating extensions.
- Default value: true
---

## Plan:

Note: About the expected downtime of the database during the update:

When using load balancing for read-only traffic (the "Type A" and "Type C" schemes), zero downtime is expected (for read traffic), provided there is more than one replica in the cluster. For write traffic (to the Primary), the expected downtime is ~5-10 seconds.

#### 1. PRE-UPDATE: Perform Pre-Checks
- Test PostgreSQL DB Access
- Make sure that physical replication is active
- Stop, if there are no active replicas
- Make sure there is no high replication lag
- Note: no more than `max_replication_lag_bytes`
- Stop, if replication lag is high
- Make sure there are no long-running transactions
- no more than `max_transaction_sec`
- Stop, if long-running transactions detected
#### 2. UPDATE: Secondary (one by one)
- Stop read-only traffic
- Enable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
- Reload patroni service
- Make sure replica endpoint is unavailable
- Wait for active transactions to complete
- Stop Services
- Execute CHECKPOINT before stopping PostgreSQL
- Stop Patroni service on the Cluster Replica
- Update PostgreSQL
- if `target` variable is not defined or `target=postgres`
- Install the latest version of PostgreSQL packages
- Update Patroni
- if `target=patroni` (or `system`)
- Install the latest version of Patroni package
- Update all system packages (includes PostgreSQL and Patroni)
- if `target=system`
- Update all system packages
- Start Services
- Start Patroni service
- Wait for Patroni port to become open on the host
- Check that the Patroni is healthy
- Check PostgreSQL is started and accepting connections
- Start read-only traffic
- Disable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
- Reload patroni service
- Make sure replica endpoint is available
- Perform the same steps for the next replica server.
#### 3. UPDATE: Primary
- Switchover Patroni leader role
- Perform switchover of the leader for the Patroni cluster
- Make sure that the Patroni is healthy and is a replica
- Notes:
- At this stage, the leader becomes a replica
- the database downtime is ~5 seconds (write traffic)
- Stop read-only traffic
- Enable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
- Reload patroni service
- Make sure replica endpoint is unavailable
- Wait for active transactions to complete
- Stop Services
- Execute CHECKPOINT before stopping PostgreSQL
- Stop Patroni service on the old Cluster Leader
- Update PostgreSQL
- if `target` variable is not defined or `target=postgres`
- Install the latest version of PostgreSQL packages
- Update Patroni
- if `target=patroni` (or `system`)
- Install the latest version of Patroni package
- Update all system packages (includes PostgreSQL and Patroni)
- if `target=system`
- Update all system packages
- Start Services
- Start Patroni service
- Wait for Patroni port to become open on the host
- Check that the Patroni is healthy
- Check PostgreSQL is started and accepting connections
- Start read-only traffic
- Disable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
- Reload patroni service
- Make sure replica endpoint is available
#### 4. POST-UPDATE: Update extensions
- Update extensions
- Get the current Patroni Cluster Leader Node
- Get a list of databases
- Update extensions in each database
- Get a list of old PostgreSQL extensions
- Update old PostgreSQL extensions (if an update is required)
- Check the Patroni cluster state
- Check the current PostgreSQL version
- List the Patroni cluster members
- Update completed.
23 changes: 23 additions & 0 deletions roles/update/tasks/extensions.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
- name: 'Get the current Patroni Cluster Leader Node'
uri:
url: http://{{ inventory_hostname }}:{{ patroni_restapi_port }}/leader
status_code: 200
register: patroni_leader_result
changed_when: false
failed_when: false

- name: Get a list of databases
command: psql -tAXc "select datname from pg_catalog.pg_database where not datistemplate"
register: databases_list
changed_when: false
when:
- patroni_leader_result.status == 200

- name: Update extensions in each database
include_tasks: update_extensions.yml
loop: "{{ databases_list.stdout_lines }}"
loop_control:
loop_var: pg_target_dbname
when: databases_list.stdout_lines is defined
...
73 changes: 73 additions & 0 deletions roles/update/tasks/patroni.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
# patroni_installation_method: "pip"
- block:
- name: Install the latest version of Patroni
pip:
name: patroni
state: latest
executable: pip3
extra_args: "--trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org"
umask: "0022"
environment:
PATH: "{{ ansible_env.PATH }}:/usr/local/bin:/usr/bin"
when: installation_method == "repo" and patroni_installation_method == "pip"
environment: "{{ proxy_env | default({}) }}"
vars:
ansible_python_interpreter: /usr/bin/python3

# patroni_installation_method: "rpm/deb"
- block:
# Debian
- name: Install the latest version of Patroni packages
package:
name: "{{ patroni_packages| default('patroni')}}"
state: latest
when: ansible_os_family == "Debian" and patroni_deb_package_repo | length < 1

# RedHat
- name: Install the latest version of Patroni packages
package:
name: "{{ patroni_packages| default('patroni')}}"
state: latest
when: ansible_os_family == "RedHat" and patroni_rpm_package_repo | length < 1

# when patroni_deb_package_repo or patroni_rpm_package_repo URL is defined
# Debian
- name: Download Patroni deb package
get_url:
url: "{{ item }}"
dest: /tmp/
timeout: 60
validate_certs: false
loop: "{{ patroni_deb_package_repo | list }}"
when: ansible_os_family == "Debian" and patroni_deb_package_repo | length > 0

- name: Install Patroni from deb package
apt:
force_apt_get: true
deb: "/tmp/{{ item }}"
state: present
loop: "{{ patroni_deb_package_repo | map('basename') | list }}"
when: ansible_os_family == "Debian" and patroni_deb_package_repo | length > 0

# RedHat
- name: Download Patroni rpm package
get_url:
url: "{{ item }}"
dest: /tmp/
timeout: 60
validate_certs: false
loop: "{{ patroni_rpm_package_repo | list }}"
when: ansible_os_family == "RedHat" and patroni_rpm_package_repo | length > 0

- name: Install Patroni from rpm package
package:
name: "/tmp/{{ item }}"
state: present
loop: "{{ patroni_rpm_package_repo | map('basename') | list }}"
when: ansible_os_family == "RedHat" and patroni_rpm_package_repo | length > 0
environment: "{{ proxy_env | default({}) }}"
when:
- installation_method == "repo"
- (patroni_installation_method == "rpm" or patroni_installation_method == "deb")
...
25 changes: 25 additions & 0 deletions roles/update/tasks/postgres.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
- name: Clean yum cache
command: yum clean all
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == '7'

- name: Clean dnf cache
command: dnf clean all
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version is version('8', '>=')

- name: Update apt cache
apt:
update_cache: true
cache_valid_time: 3600
when: ansible_os_family == "Debian"

- name: Install the latest version of PostgreSQL packages
package:
name: "{{ item }}"
state: latest
loop: "{{ postgresql_packages }}"
...
81 changes: 81 additions & 0 deletions roles/update/tasks/pre_checks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
- name: '[Pre-Check] (ALL) Test PostgreSQL DB Access'
command: psql -tAXc 'select 1'
changed_when: false

- name: '[Pre-Check] Make sure that physical replication is active'
command: >-
psql -tAXc "select count(*) from pg_stat_replication
where application_name != 'pg_basebackup'"
register: pg_replication_state
changed_when: false
when:
- inventory_hostname in groups['primary']

# Stop, if there are no active replicas
- name: "Pre-Check error. Print physical replication state"
fail:
msg: "There are no active replica servers (pg_stat_replication returned 0 entries)."
when:
- inventory_hostname in groups['primary']
- pg_replication_state.stdout | int == 0

- name: '[Pre-Check] Make sure there is no high replication lag (more than {{ max_replication_lag_bytes | human_readable }})'
command: >-
psql -tAXc "select pg_wal_lsn_diff(pg_current_wal_lsn(),
replay_lsn) pg_lag_bytes from pg_stat_replication
order by pg_lag_bytes desc limit 1"
register: pg_lag_bytes
changed_when: false
failed_when: false
until: pg_lag_bytes.stdout|int < max_replication_lag_bytes|int
retries: 30
delay: 5
when:
- inventory_hostname in groups['primary']

# Stop, if replication lag is high
- block:
- name: "Print replication lag"
debug:
msg: "Current replication lag:
{{ pg_lag_bytes.stdout | int | human_readable }}"

- name: "Pre-Check error. Please try again later"
fail:
msg: High replication lag on the Patroni Cluster, please try again later.
when:
- pg_lag_bytes.stdout is defined
- pg_lag_bytes.stdout|int >= max_replication_lag_bytes|int

- name: '[Pre-Check] Make sure there are no long-running transactions (more than {{ max_transaction_sec }} seconds)'
command: >-
psql -tAXc "select pid, usename, client_addr, clock_timestamp() - xact_start as xact_age,
state, wait_event_type ||':'|| wait_event as wait_events,
left(regexp_replace(query, E'[ \\t\\n\\r]+', ' ', 'g'),100) as query
from pg_stat_activity
where clock_timestamp() - xact_start > '{{ max_transaction_sec }} seconds'::interval
and backend_type = 'client backend' and pid <> pg_backend_pid()
order by xact_age desc limit 10"
register: pg_long_transactions
changed_when: false
failed_when: false
until: pg_long_transactions.stdout | length < 1
retries: 30
delay: 2
when:
- inventory_hostname in groups['primary']

# Stop, if long-running transactions detected
- block:
- name: "Print long-running (>{{ max_transaction_sec }}s) transactions"
debug:
msg: "{{ pg_long_transactions.stdout_lines }}"

- name: "Pre-Check error. Please try again later"
fail:
msg: long-running transactions detected (more than {{ max_transaction_sec }} seconds), please try again later.
when:
- pg_long_transactions.stdout is defined
- pg_long_transactions.stdout | length > 0
...
Loading

0 comments on commit c843528

Please sign in to comment.