Point In Time Recovery #551

Zvirovyi · 2024-11-20T20:01:05Z

Important!

This PR relies on the canonical/charmed-mysql-snap#56.

Overview

MySQL stores binary transactions logs. This PR adds a service job to upload these logs to the S3 bucket and the ability to use them later for a point-in-time-recovery with a new restore-to-time parameter during restore. This new parameter accepts MySQL timestamp or keyword latest (for replaying all the transaction logs).

Also, a new application blocked status is introduced - Another cluster S3 repository to signal user that used S3 repository is claimed by the another cluster and binlogs collecting job is disabled and creating new backups is restricted (these are the only workload limitation). This is crucial to keep stored binary logs safe from the another clusters. This check uses @@GLOBAL.group_replication_group_name.
After restore, cluster group replication is reinitialized, so practically it becomes a new different cluster. For these cases, Another cluster S3 repository message is changed to the Move restored cluster to another S3 repository to indicate this event more conveniently for the user.
Both the block messages will disappear when S3 configuration is removed or changed to the empty repository.

Usage example

deploy mysql + s3-integrator and integrate them
create full backup
create test data:

create database zvirovyi;
use zvirovyi;
create table asd(message varchar(255) primary key);
select current_timestamp; # 2024-11-20 17:10:01
insert into asd values ('hello');
select current_timestamp; # 2024-11-20 17:10:12
insert into asd values ('world');
flush binary logs;

wait several minutes for binlogs to be uploaded
restore: juju run mysql/leader restore backup-id=2024-11-20T17:08:24Z restore-to-time="2024-11-20 17:10:01"

use zvirovyi;
select * from asd; # empty set returned

observe application block message
restore: juju run mysql/leader restore backup-id=2024-11-20T17:08:24Z restore-to-time="latest"

use zvirovyi;
select * from asd; # hello, world returned

Key notes

binlogs collecting and PITR functionality depends on the https://github.com/canonical/mysql-pitr-helper

# Conflicts: # src/upgrade.py

Zvirovyi · 2024-12-08T09:12:21Z

~~Tests are WIP!~~
UPD: integration tests done, and I will fix + add unit tests after PR review.

carlcsaposs-canonical · 2024-12-12T15:26:59Z

lib/charms/mysql/v0/backups.py

        if not success:
            logger.error(f"Restore failed: {error_message}")
            event.fail(error_message)
-
            if recoverable:
                self._clean_data_dir_and_start_mysqld()
            else:
+                self.charm.app_peer_data.update({
+                    "s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR,
+                    "binlogs-collecting": "",


question: if the restore fails and is not recoverable, does this instruct the user to change the s3 repository?

if so, is that the safest & most relevant action for the user to take?

I think it's the most convenient way to report a restore error with event fail and blocked unit status simultaneously while keeping s3-block-message as application status.
It's important to mention, that while the unit is blocked, application status can't transit to the s3-block-message due to the MySQLOperatorCharm._on_update_status -> if not self.cluster_initialized -> if self._mysql.cluster_metadata_exists(self.get_unit_address(unit)), that will fail like:

unit-mysql-0: 01:50:14 WARNING unit.mysql/0.juju-log Failed to check if cluster metadata exists from_instance='10.88.216.67' unit-mysql-0: 01:50:14 INFO unit.mysql/0.juju-log skip status update when not initialized

Summarizing, if this particular check fails, the application status will be next:

Also, right below your code reference, there is the next code block:

mysql-operator/lib/charms/mysql/v0/backups.py

Lines 578 to 581 in c730506

self.charm.app_peer_data.update({

"s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR,

"binlogs-collecting": "",

})

^ it will imply the same logic I talking about if any of the later backup steps are failed such as _clean_data_dir_and_start_mysqld, _pitr_restore or _post_restore.

I don't think I understand. To clarify, my question was referring to lines 569-571

while keeping s3-block-message as application status

if the restore fails not recoverable, what should the user do?

while the unit is blocked, application status can't transit to the s3-block-message

what happens if the user removes that unit?

I don't think I understand. To clarify, my question was referring to lines 569-571

Yes, I'm also referring to the 569-571. I'm just pointing to the 578-581 lines to inform that the same logic (like when _restore fails and not recoverable) will be applied to the later steps like _pitr_restore and _post_restore assuming that these operations are not recoverable.

if the restore fails not recoverable, what should the user do?

When it's not recoverable, user will receive failed event notification and face single blocked unit. You can see this on the screenshot I've attached to the previous message.

Best course for user in this situation will be:

investigate underlying issue in the debug-log and resolve it (for example, user can re-create cluster, see documentation section "Migrate a cluster")

run the same restore again / run another restore

Maybe it would be a great idea to put some notice about possible restore failures directly in the documentation.

what happens if the user removes that unit?

Before restore:
unit = active / idle
application = active

Restore failed:
unit = blocked / idle
application = active

User removes unit:
application = active

User adds unit:
unit = active / idle
application = S3 repository claimed by another cluster

Restore succeed:
unit = active / idle
application = Move restored cluster to another S3 repository

The message is misleading. The user can remove and re-relate to the same s3 integrator/credentials.

src/mysql_vm_helpers.py

lib/charms/mysql/v0/backups.py

carlcsaposs-canonical · 2024-12-13T13:04:15Z

src/mysql_vm_helpers.py

+        if is_running and (
+            not self.charm.unit.is_leader() or "binlogs-collecting" not in self.charm.app_peer_data
+        ):


will there be any issues if multiple units are running the service at once?

if so, as the code is currently written, I think there are might be some cases where a leader switchover could cause multiple units to run the service at once—if the peer relation data changes during a switchover, there might be a few Juju event delay (which can be seconds to minutes, depending on deferred events) where two units are running the service—and if the peer relation data doesn't change during a leader switchover, I think there could be two units running the service for an extended period of time

There are no problems with several units running this service simultaneously - it's safe

shayancanonical

Since we are only collecting binlogs from the leader unit, we need to add handling for when Juju elects a new leader -> what should happen here? Should the new leader unit start collecting binlogs instead?

Furthermore, while thinking of the above use case, we will also handle the scaling scenario -> what if the leader unit is scaled down?

Also, I would really prefer it if we could add an integration test for the above scenario (where the leader unit is scaled down after which the PITR is performed) after we determine how to handle the scenario

lib/charms/mysql/v0/backups.py

lib/charms/mysql/v0/s3_helpers.py

paulomach

Left some comments and I'll try to test it.

paulomach · 2025-01-06T13:24:47Z

lib/charms/mysql/v0/backups.py

@@ -489,7 +528,12 @@ def _on_restore(self, event: ActionEvent) -> None:
            return

        backup_id = event.params["backup-id"].strip().strip("/")
-        logger.info(f"A restore with backup-id {backup_id} has been requested on unit")
+        restore_to_time = event.params.get("restore-to-time")


Please include validation (through regex?) for the restore_to_time input. Otherwise, a bad formatted input will be catched only when running the restore.

Just to reinforce it, failing to set correct format will break recovery, e.g. by not setting seconds restore-to-time="2025-01-08 20:50", as I just did

paulomach · 2025-01-06T13:33:56Z

actions.yaml

@@ -55,6 +55,9 @@ restore:
    backup-id:
      type: string
      description: A backup-id to identify the backup to restore (format = %Y-%m-%dT%H:%M:%SZ)
+    restore-to-time:
+      type: string
+      description: Point-in-time-recovery target in MySQL timestamp format.


for consistency sake include format, as done for backup-id

src/mysql_vm_helpers.py

paulomach · 2025-01-06T14:58:31Z

lib/charms/mysql/v0/backups.py

        if not success:
            logger.error(f"Restore failed: {error_message}")
            event.fail(error_message)
-
            if recoverable:
                self._clean_data_dir_and_start_mysqld()
            else:
+                self.charm.app_peer_data.update({
+                    "s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR,
+                    "binlogs-collecting": "",


The message is misleading. The user can remove and re-relate to the same s3 integrator/credentials.

lib/charms/mysql/v0/backups.py

Co-authored-by: Paulo Machado <[email protected]>

Zvirovyi added 3 commits November 15, 2024 22:17

Add binlog_utils_udf plugin.

405d5a4

Enable gtid_mode and enforce_gtid_consistency for the MySQL.

ae58701

Add S3 compatibility check based on the group replication id.

2ce63f7

Zvirovyi mentioned this pull request Nov 25, 2024

Point In Time Recovery canonical/mysql-k8s-operator#531

Open

Zvirovyi added 2 commits December 4, 2024 06:29

Merge branch 'main' into pitr

85ddea2

# Conflicts: # src/upgrade.py

Point-in-time-recovery.

7872d7d

Zvirovyi marked this pull request as ready for review December 8, 2024 09:12

Zvirovyi added 3 commits December 8, 2024 12:09

Merge branch 'refs/heads/main' into pitr

f57a83e

Integration tests.

c730506

Merge branch 'main' into pitr

86dfad7

carlcsaposs-canonical reviewed Dec 13, 2024

View reviewed changes

shayancanonical reviewed Dec 16, 2024

View reviewed changes

lib/charms/mysql/v0/backups.py Show resolved Hide resolved

lib/charms/mysql/v0/s3_helpers.py Outdated Show resolved Hide resolved

Zvirovyi added 4 commits December 16, 2024 23:32

Merge branch 'refs/heads/main' into pitr

193a552

Binlogs collector service improvement.

3705926

Merge branch 'refs/heads/main' into pitr

bc15164

Binlogs collector service improvement.

b7783e8

paulomach reviewed Jan 6, 2025

View reviewed changes

Zvirovyi added 2 commits January 7, 2025 20:35

Use context manager for ca_file in s3_helpers.

0d65641

Rename start_stop_binlogs_collecting to reconcile_binlogs_collection.

db8f9ad

paulomach reviewed Jan 8, 2025

View reviewed changes

lib/charms/mysql/v0/backups.py Outdated Show resolved Hide resolved

Zvirovyi and others added 4 commits January 9, 2025 02:57

Delete binlogs collector config when not needed.

9ac2a34

Improve update_binlogs_collector_config.

38567e8

Co-authored-by: Paulo Machado <[email protected]>

Format.

a2178a7

Merge branch 'main' into pitr

1b1cebb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Point In Time Recovery #551

Point In Time Recovery #551

Zvirovyi commented Nov 20, 2024 •

edited

Loading

Zvirovyi commented Dec 8, 2024 •

edited

Loading

carlcsaposs-canonical Dec 12, 2024

Zvirovyi Dec 17, 2024

carlcsaposs-canonical Dec 17, 2024

Zvirovyi Dec 20, 2024

paulomach Jan 6, 2025

carlcsaposs-canonical Dec 13, 2024

Zvirovyi Dec 27, 2024

shayancanonical left a comment •

edited

Loading

paulomach left a comment

paulomach Jan 6, 2025

paulomach Jan 8, 2025

paulomach Jan 6, 2025

paulomach Jan 6, 2025

	self.charm.app_peer_data.update({
	"s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR,
	"binlogs-collecting": "",
	})

Point In Time Recovery #551

Are you sure you want to change the base?

Point In Time Recovery #551

Conversation

Zvirovyi commented Nov 20, 2024 • edited Loading

Important!

Overview

Usage example

Key notes

Zvirovyi commented Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayancanonical left a comment • edited Loading

Choose a reason for hiding this comment

paulomach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zvirovyi commented Nov 20, 2024 •

edited

Loading

Zvirovyi commented Dec 8, 2024 •

edited

Loading

shayancanonical left a comment •

edited

Loading