tablet should stay healthy while running xtrabackup#5066
tablet should stay healthy while running xtrabackup#5066deepthi merged 7 commits intovitessio:masterfrom
Conversation
…ents tablet from updating its replication lag Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
|
|
||
| // Backup takes a db backup and sends it to the BackupStorage | ||
| func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger logutil.Logger, allowMaster bool) error { | ||
| if err := agent.lock(ctx); err != nil { |
There was a problem hiding this comment.
Just had a thought: Should we have a separate lock to allow only one backup at a time? I guess technically it might work to have multiple xtrabackups running, but it's probably not a good idea?
There was a problem hiding this comment.
Would not locking allow it to be promoted to master? Is that ok?
There was a problem hiding this comment.
+1 for cluster backup lock. It is unlikely you would want a new backup if one is already running, and limiting it helps ensure service only degrades so much.
There was a problem hiding this comment.
Hm as far as the tablet is concerned, I think that would be acceptable. It's better than blocking promotion to master because xtrabackup is running; presumably you are promoting this one because the current master is in bad shape anyway. WDYT?
I don't know for sure if XtraBackup will still produce a consistent snapshot in that case, but I would expect it to.
There was a problem hiding this comment.
I don't know a specific reason for it to not be promoted to master, just bringing it up for discussion. Would the tablet type still read BACKUP? My guess is that would prevent most workloads from choosing it to failover to, even though they probably should like you said.
There was a problem hiding this comment.
we are not changing the tablet type to BACKUP while xtrabackup is running, which means a REPLICA would be eligible for master promotion.
There was a problem hiding this comment.
That makes sense. Will there be any knowledge that a backup is running? I would imagine that when selecting a tablet for master promotion, you'd want to prefer a replica not currently running a backup. I would hope to get that info in wr.ShardReplicationStatuses or something like it.
There was a problem hiding this comment.
there should be, but I don't think there is now. Another thing we should fix.
|
I tried out this patch with a set of 250G shards that take 1.5hrs to back up, and the tablets stayed healthy. |
|
|
…r online backup is running Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
|
Testing update:
|
Signed-off-by: deepthi <deepthi@planetscale.com>
| return err | ||
| } | ||
| } else { | ||
| if agent._isOnlineBackupRunning { |
There was a problem hiding this comment.
This should be done while holding the lock. Same thing while exporting stats. And same comments in Backup
There was a problem hiding this comment.
+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.
There was a problem hiding this comment.
I have called them beginBackup and endBackup. we set _isBackupRunning to true or false and then export the right value in the stats so that regardless of online/offline, we can see the stats in /debug/vars or prometheus metrics.
| return err | ||
| } | ||
| } else { | ||
| if agent._isOnlineBackupRunning { |
There was a problem hiding this comment.
+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.
… to protect all access to _isBackupRunning and the stats variable Signed-off-by: deepthi <deepthi@planetscale.com>
Fixes #5062.
As documented in the issue, here's how we address this:
Signed-off-by: deepthi deepthi@planetscale.com