Skip to content

tablet should stay healthy while running xtrabackup#5066

Merged
deepthi merged 7 commits intovitessio:masterfrom
planetscale:ds-xb-5062
Aug 15, 2019
Merged

tablet should stay healthy while running xtrabackup#5066
deepthi merged 7 commits intovitessio:masterfrom
planetscale:ds-xb-5062

Conversation

@deepthi
Copy link
Copy Markdown
Collaborator

@deepthi deepthi commented Aug 9, 2019

Fixes #5062.
As documented in the issue, here's how we address this:

  • don't take actionMutex lock while running xtrabackup
  • introduce a boolean _isBackupRunning
  • only allow a backup to start if _isBackupRunning is false
  • protect updates to boolean using the mutex lock
  • export this boolean to debug/vars so that there is a way for people to know that a backup is running

Signed-off-by: deepthi deepthi@planetscale.com

…ents tablet from updating its replication lag

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested a review from sougou as a code owner August 9, 2019 00:33
@deepthi deepthi requested a review from enisoc August 9, 2019 00:35
Signed-off-by: deepthi <deepthi@planetscale.com>

// Backup takes a db backup and sends it to the BackupStorage
func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger logutil.Logger, allowMaster bool) error {
if err := agent.lock(ctx); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a thought: Should we have a separate lock to allow only one backup at a time? I guess technically it might work to have multiple xtrabackups running, but it's probably not a good idea?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not locking allow it to be promoted to master? Is that ok?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for cluster backup lock. It is unlikely you would want a new backup if one is already running, and limiting it helps ensure service only degrades so much.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm as far as the tablet is concerned, I think that would be acceptable. It's better than blocking promotion to master because xtrabackup is running; presumably you are promoting this one because the current master is in bad shape anyway. WDYT?

I don't know for sure if XtraBackup will still produce a consistent snapshot in that case, but I would expect it to.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a specific reason for it to not be promoted to master, just bringing it up for discussion. Would the tablet type still read BACKUP? My guess is that would prevent most workloads from choosing it to failover to, even though they probably should like you said.

Copy link
Copy Markdown
Collaborator Author

@deepthi deepthi Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not changing the tablet type to BACKUP while xtrabackup is running, which means a REPLICA would be eligible for master promotion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Will there be any knowledge that a backup is running? I would imagine that when selecting a tablet for master promotion, you'd want to prefer a replica not currently running a backup. I would hope to get that info in wr.ShardReplicationStatuses or something like it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be, but I don't think there is now. Another thing we should fix.

@enisoc
Copy link
Copy Markdown
Member

enisoc commented Aug 9, 2019

I tried out this patch with a set of 250G shards that take 1.5hrs to back up, and the tablets stayed healthy.

@deepthi
Copy link
Copy Markdown
Collaborator Author

deepthi commented Aug 9, 2019

I think we'll need to rework this. Apparently the actionMutex that agent.lock() acquires a lock on is the right one, because it prevents two actions from running in parallel. This means that we need to find out why healthcheck is blocked when the actionMutex is locked.

deepthi added 2 commits August 9, 2019 21:23
…r online backup is running

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi changed the title WIP: don't take the action lock while running xtrabackup WIP: tablet should stay healthy while running xtrabackup Aug 10, 2019
@deepthi
Copy link
Copy Markdown
Collaborator Author

deepthi commented Aug 14, 2019

Testing update:

  • unit test for stats - turns out there is a reason we can't unit test action_agent stats. It is because stats vars are global and we create >1 TestActionAgent in the tests.
  • test that tablet actually stays healthy - verified by @enisoc
  • test that a call to Backup fails if a backup is already running - verified on local cluster
  • sanity test of debug/vars - verified on local cluster

@deepthi deepthi changed the title WIP: tablet should stay healthy while running xtrabackup tablet should stay healthy while running xtrabackup Aug 14, 2019
@deepthi
Copy link
Copy Markdown
Collaborator Author

deepthi commented Aug 14, 2019

@enisoc @sougou this is ready for review. Let me know if you feel that we need an integ test for multiple backups running at one time.
cc @rafael

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Copy Markdown
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. I'll let @enisoc do the full review.

return err
}
} else {
if agent._isOnlineBackupRunning {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done while holding the lock. Same thing while exporting stats. And same comments in Backup

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have called them beginBackup and endBackup. we set _isBackupRunning to true or false and then export the right value in the stats so that regardless of online/offline, we can see the stats in /debug/vars or prometheus metrics.

return err
}
} else {
if agent._isOnlineBackupRunning {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.

… to protect all access to _isBackupRunning and the stats variable

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Copy Markdown
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XtraBackup: Tablet goes unhealthy during backup

5 participants