tablet should stay healthy while running xtrabackup by deepthi · Pull Request #5066 · vitessio/vitess

deepthi · 2019-08-09T00:33:59Z

Fixes #5062.
As documented in the issue, here's how we address this:

don't take actionMutex lock while running xtrabackup
introduce a boolean _isBackupRunning
only allow a backup to start if _isBackupRunning is false
protect updates to boolean using the mutex lock
export this boolean to debug/vars so that there is a way for people to know that a backup is running

Signed-off-by: deepthi deepthi@planetscale.com

…ents tablet from updating its replication lag Signed-off-by: deepthi <deepthi@planetscale.com>

go/vt/vttablet/tabletmanager/rpc_backup.go

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc · 2019-08-09T01:30:28Z

go/vt/vttablet/tabletmanager/rpc_backup.go


 // Backup takes a db backup and sends it to the BackupStorage
 func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger logutil.Logger, allowMaster bool) error {
-	if err := agent.lock(ctx); err != nil {


Just had a thought: Should we have a separate lock to allow only one backup at a time? I guess technically it might work to have multiple xtrabackups running, but it's probably not a good idea?

Would not locking allow it to be promoted to master? Is that ok?

+1 for cluster backup lock. It is unlikely you would want a new backup if one is already running, and limiting it helps ensure service only degrades so much.

Hm as far as the tablet is concerned, I think that would be acceptable. It's better than blocking promotion to master because xtrabackup is running; presumably you are promoting this one because the current master is in bad shape anyway. WDYT?

I don't know for sure if XtraBackup will still produce a consistent snapshot in that case, but I would expect it to.

I don't know a specific reason for it to not be promoted to master, just bringing it up for discussion. Would the tablet type still read BACKUP? My guess is that would prevent most workloads from choosing it to failover to, even though they probably should like you said.

we are not changing the tablet type to BACKUP while xtrabackup is running, which means a REPLICA would be eligible for master promotion.

That makes sense. Will there be any knowledge that a backup is running? I would imagine that when selecting a tablet for master promotion, you'd want to prefer a replica not currently running a backup. I would hope to get that info in wr.ShardReplicationStatuses or something like it.

there should be, but I don't think there is now. Another thing we should fix.

enisoc · 2019-08-09T06:01:31Z

I tried out this patch with a set of 250G shards that take 1.5hrs to back up, and the tablets stayed healthy.

deepthi · 2019-08-09T16:16:02Z

I think we'll need to rework this. Apparently the actionMutex that agent.lock() acquires a lock on is the right one, because it prevents two actions from running in parallel. This means that we need to find out why healthcheck is blocked when the actionMutex is locked.

…r online backup is running Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi · 2019-08-14T02:12:42Z

Testing update:

unit test for stats - turns out there is a reason we can't unit test action_agent stats. It is because stats vars are global and we create >1 TestActionAgent in the tests.
test that tablet actually stays healthy - verified by @enisoc
test that a call to Backup fails if a backup is already running - verified on local cluster
sanity test of debug/vars - verified on local cluster

deepthi · 2019-08-14T02:13:29Z

@enisoc @sougou this is ready for review. Let me know if you feel that we need an integ test for multiple backups running at one time.
cc @rafael

Signed-off-by: deepthi <deepthi@planetscale.com>

sougou

One nit. I'll let @enisoc do the full review.

sougou · 2019-08-14T03:32:06Z

go/vt/vttablet/tabletmanager/rpc_backup.go

 			return err
 		}
+	} else {
+		if agent._isOnlineBackupRunning {


This should be done while holding the lock. Same thing while exporting stats. And same comments in Backup

+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.

I have called them beginBackup and endBackup. we set _isBackupRunning to true or false and then export the right value in the stats so that regardless of online/offline, we can see the stats in /debug/vars or prometheus metrics.

go/vt/vttablet/tabletmanager/action_agent.go

go/vt/vttablet/tabletmanager/rpc_backup.go

enisoc · 2019-08-14T16:42:38Z

go/vt/vttablet/tabletmanager/rpc_backup.go

 			return err
 		}
+	} else {
+		if agent._isOnlineBackupRunning {


+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.

… to protect all access to _isBackupRunning and the stats variable Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc

lgtm

don't take the action lock while running xtrabackup because that prev…

89c4752

…ents tablet from updating its replication lag Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi requested a review from sougou as a code owner August 9, 2019 00:33

deepthi requested a review from enisoc August 9, 2019 00:35

enisoc reviewed Aug 9, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_backup.go Show resolved Hide resolved

enisoc reviewed Aug 9, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_backup.go Outdated Show resolved Hide resolved

enisoc reviewed Aug 9, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_backup.go Show resolved Hide resolved

address review comments

e663c31

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc reviewed Aug 9, 2019

View reviewed changes

enisoc mentioned this pull request Aug 9, 2019

xtrabackup: Better support for large datasets #5065

Merged

deepthi added 2 commits August 9, 2019 21:23

implement boolean state for xtrabackup, and stats variable for whethe…

f0b6b96

…r online backup is running Signed-off-by: deepthi <deepthi@planetscale.com>

check whether backup is already running

9a2571b

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi changed the title ~~WIP: don't take the action lock while running xtrabackup~~ WIP: tablet should stay healthy while running xtrabackup Aug 10, 2019

Merge branch 'master' into ds-xb-5062

999613d

deepthi changed the title ~~WIP: tablet should stay healthy while running xtrabackup~~ tablet should stay healthy while running xtrabackup Aug 14, 2019

add doc for new stats var

43bb0cf

Signed-off-by: deepthi <deepthi@planetscale.com>

sougou reviewed Aug 14, 2019

View reviewed changes

enisoc requested changes Aug 14, 2019

View reviewed changes

export stats for both online and offline backups, use mutex correctly…

c51e07f

… to protect all access to _isBackupRunning and the stats variable Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc approved these changes Aug 15, 2019

View reviewed changes

deepthi merged commit 9830594 into vitessio:master Aug 15, 2019

enisoc deleted the ds-xb-5062 branch August 15, 2019 02:20

deepthi mentioned this pull request Aug 23, 2019

healthcheck after backup should be run only for offline backups #5129

Merged

arka-g mentioned this pull request Sep 10, 2019

Slack vitess 2019.08.26.r0 tinyspeck/vitess#137

Merged

Conversation

deepthi commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepthi Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enisoc commented Aug 9, 2019

Uh oh!

deepthi commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepthi commented Aug 14, 2019

Uh oh!

deepthi commented Aug 14, 2019

Uh oh!

sougou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enisoc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

deepthi commented Aug 9, 2019 •

edited

Loading

deepthi Aug 9, 2019 •

edited

Loading

deepthi commented Aug 9, 2019 •

edited

Loading