-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Point In Time Recovery #551
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # src/upgrade.py
|
if not success: | ||
logger.error(f"Restore failed: {error_message}") | ||
event.fail(error_message) | ||
|
||
if recoverable: | ||
self._clean_data_dir_and_start_mysqld() | ||
else: | ||
self.charm.app_peer_data.update({ | ||
"s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR, | ||
"binlogs-collecting": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: if the restore fails and is not recoverable
, does this instruct the user to change the s3 repository?
if so, is that the safest & most relevant action for the user to take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the most convenient way to report a restore error with event fail and blocked unit status simultaneously while keeping s3-block-message
as application status.
It's important to mention, that while the unit is blocked, application status can't transit to the s3-block-message
due to the MySQLOperatorCharm._on_update_status
-> if not self.cluster_initialized
-> if self._mysql.cluster_metadata_exists(self.get_unit_address(unit))
, that will fail like:
unit-mysql-0: 01:50:14 WARNING unit.mysql/0.juju-log Failed to check if cluster metadata exists from_instance='10.88.216.67'
unit-mysql-0: 01:50:14 INFO unit.mysql/0.juju-log skip status update when not initialized
Summarizing, if this particular check fails, the application status will be next:
Also, right below your code reference, there is the next code block:
mysql-operator/lib/charms/mysql/v0/backups.py
Lines 578 to 581 in c730506
self.charm.app_peer_data.update({ | |
"s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR, | |
"binlogs-collecting": "", | |
}) |
^ it will imply the same logic I talking about if any of the later backup steps are failed such as _clean_data_dir_and_start_mysqld
, _pitr_restore
or _post_restore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand. To clarify, my question was referring to lines 569-571
while keeping
s3-block-message
as application status
if the restore fails not recoverable
, what should the user do?
while the unit is blocked, application status can't transit to the
s3-block-message
what happens if the user removes that unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand. To clarify, my question was referring to lines 569-571
Yes, I'm also referring to the 569-571. I'm just pointing to the 578-581 lines to inform that the same logic (like when _restore
fails and not recoverable
) will be applied to the later steps like _pitr_restore
and _post_restore
assuming that these operations are not recoverable
.
if the restore fails not recoverable, what should the user do?
When it's not recoverable
, user will receive failed event notification and face single blocked unit. You can see this on the screenshot I've attached to the previous message.
Best course for user in this situation will be:
- investigate underlying issue in the
debug-log
and resolve it (for example, user can re-create cluster, see documentation section "Migrate a cluster") - run the same restore again / run another restore
Maybe it would be a great idea to put some notice about possible restore failures directly in the documentation.
what happens if the user removes that unit?
- Before restore:
unit =active / idle
application =active
- Restore failed:
unit =blocked / idle
application =active
- User removes unit:
application =active
- User adds unit:
unit =active / idle
application =S3 repository claimed by another cluster
- Restore succeed:
unit =active / idle
application =Move restored cluster to another S3 repository
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is misleading. The user can remove and re-relate to the same s3 integrator/credentials.
src/mysql_vm_helpers.py
Outdated
if is_running and ( | ||
not self.charm.unit.is_leader() or "binlogs-collecting" not in self.charm.app_peer_data | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be any issues if multiple units are running the service at once?
if so, as the code is currently written, I think there are might be some cases where a leader switchover could cause multiple units to run the service at once—if the peer relation data changes during a switchover, there might be a few Juju event delay (which can be seconds to minutes, depending on deferred events) where two units are running the service—and if the peer relation data doesn't change during a leader switchover, I think there could be two units running the service for an extended period of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no problems with several units running this service simultaneously - it's safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are only collecting binlogs from the leader unit, we need to add handling for when Juju elects a new leader -> what should happen here? Should the new leader unit start collecting binlogs instead?
Furthermore, while thinking of the above use case, we will also handle the scaling scenario -> what if the leader unit is scaled down?
Also, I would really prefer it if we could add an integration test for the above scenario (where the leader unit is scaled down after which the PITR is performed) after we determine how to handle the scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments and I'll try to test it.
@@ -489,7 +528,12 @@ def _on_restore(self, event: ActionEvent) -> None: | |||
return | |||
|
|||
backup_id = event.params["backup-id"].strip().strip("/") | |||
logger.info(f"A restore with backup-id {backup_id} has been requested on unit") | |||
restore_to_time = event.params.get("restore-to-time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include validation (through regex?) for the restore_to_time
input. Otherwise, a bad formatted input will be catched only when running the restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to reinforce it, failing to set correct format will break recovery, e.g. by not setting seconds restore-to-time="2025-01-08 20:50"
, as I just did
@@ -55,6 +55,9 @@ restore: | |||
backup-id: | |||
type: string | |||
description: A backup-id to identify the backup to restore (format = %Y-%m-%dT%H:%M:%SZ) | |||
restore-to-time: | |||
type: string | |||
description: Point-in-time-recovery target in MySQL timestamp format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency sake include format, as done for backup-id
if not success: | ||
logger.error(f"Restore failed: {error_message}") | ||
event.fail(error_message) | ||
|
||
if recoverable: | ||
self._clean_data_dir_and_start_mysqld() | ||
else: | ||
self.charm.app_peer_data.update({ | ||
"s3-block-message": MOVE_RESTORED_CLUSTER_TO_ANOTHER_S3_REPOSITORY_ERROR, | ||
"binlogs-collecting": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is misleading. The user can remove and re-relate to the same s3 integrator/credentials.
Important!
This PR relies on the canonical/charmed-mysql-snap#56.
Overview
MySQL stores binary transactions logs. This PR adds a service job to upload these logs to the S3 bucket and the ability to use them later for a point-in-time-recovery with a new
restore-to-time
parameter during restore. This new parameter accepts MySQL timestamp or keywordlatest
(for replaying all the transaction logs).Also, a new application blocked status is introduced -
Another cluster S3 repository
to signal user that used S3 repository is claimed by the another cluster and binlogs collecting job is disabled and creating new backups is restricted (these are the only workload limitation). This is crucial to keep stored binary logs safe from the another clusters. This check uses@@GLOBAL.group_replication_group_name
.After restore, cluster group replication is reinitialized, so practically it becomes a new different cluster. For these cases,
Another cluster S3 repository
message is changed to theMove restored cluster to another S3 repository
to indicate this event more conveniently for the user.Both the block messages will disappear when S3 configuration is removed or changed to the empty repository.
Usage example
juju run mysql/leader restore backup-id=2024-11-20T17:08:24Z restore-to-time="2024-11-20 17:10:01"
juju run mysql/leader restore backup-id=2024-11-20T17:08:24Z restore-to-time="latest"
Key notes