check replication lag on state change before starting query service#5000
check replication lag on state change before starting query service#5000deepthi merged 4 commits intovitessio:masterfrom
Conversation
setassociative
left a comment
There was a problem hiding this comment.
This looks good -- interested in what your approach to validating this was with live data and if you had any problems with the tablet getting stuck in a non-serving mode (iirc I was never able to figure out why that happened to me).
|
@deepthi - This looks great. I think there are some Before we merge. Could you verify manually that this works? Run a manual integration test? The way we reproduce this is:
|
|
@deepthi - I was doing some of our internal testing I think the bug is still present in this branch. I think I was able to create instructions that should help you reproduce in any environment. Using vtbench:
If lag is beyond the serving threshold, you will see vtbench failures.
|
0547ddc to
35be115
Compare
|
The latest changes do ensure that replica does not go into serving if it is lagged more than unhealthyThreshold after completing a backup.
There are a few oddities to be noted:
|
There was a problem hiding this comment.
This is a new failure mode that we need to be aware, but I think it makes sense to fail if we can't get master position.
There was a problem hiding this comment.
Agreed. Also the same type of tablet/mysql loss is something that needs to be handled already by an operator as vttablet/mysqld could go down and fall behind the ability to catch up for several reasons already, e.g., network partition. This failure mode should be well covered by normal tooling.
The thing I'm more interested in actually is going to be
when trying to change tablet type and state after completing a backup, if the tablet is lagged, it stays in BACKUP type until it is caught up and only then transitions into REPLICA type. This seems to be an artifact of how state_change has been implemented in the code.
There was a problem hiding this comment.
i don't think it's necessary to extract all the params into separate vars, unless you need a few of the vars to overwrite, and then i'd only make those.
There was a problem hiding this comment.
I did this for 2 reasons:
- it serves as documentation of what params we actually use from the object in this function
- it avoided making a whole lot of name changes through out the functions
There was a problem hiding this comment.
i think the docs belong in the params struct docs. if it isn't used, then there was not point of putting it in the struct.
if we never change the lines, then the code will get progressively more cluttered.
There was a problem hiding this comment.
that params struct is used to pass the params down two levels of calls. Not all params are used at both levels so it is a union of the two sets of params. Do you still feel like we should not extract the params into vars?
There was a problem hiding this comment.
$0.02: I find that the backup param makes ExecuteBackup & RestoreBackup much easier to read.
If the primary reason for extracting args is to minimize churn in this review I would suggest a follow up where we move to dereferencing params instead of extracting them like this. That would leave the interface cleaner / more easy to scan and remove this noisy block of param extraction.
There was a problem hiding this comment.
$0.02: I find that the backup param makes ExecuteBackup & RestoreBackup much easier to read.
If the primary reason for extracting args is to minimize churn in this review I would suggest a follow up where we move to dereferencing params instead of extracting them like this. That would leave the interface cleaner / more easy to scan and remove this noisy block of param extraction.
go/vt/mysqlctl/backupengine.go
Outdated
There was a problem hiding this comment.
Because your params structs are used as an abstraction for "control of some action this mysqlctl API takes" I would suggest dropping BackupHandle as a user facing value. It seems that anything outside of mysqlctl.Backup|Restore is going to have the value they set overridden so we might as well not complicate matters by allowing it to be set before calling the Backup|Restore func.
No strong opinions on if this is done by making private or passing it as a separate param outside this abstraction in to the ultimate ExecuteBackup/FindBackupToRestore
There was a problem hiding this comment.
good point. I'll change this.
go/vt/mysqlctl/backupengine.go
Outdated
There was a problem hiding this comment.
For BackupParams and RestoreParams docs on the non-obvious args would be ✨. (To me the non-obvious one is mostly HookExtraEnv though I'm just assuming I understand DbName, LocalMetadata, and what the Keyspace/Shard pair is for).
There was a problem hiding this comment.
doc nit: "Wait for a reliable 'seconds behind master' value" or something. My initial read of this was that it was going to wait for a number of seconds that we considered reliable to have us get a new seconds behind master reading.
There was a problem hiding this comment.
I'll fix the comment
There was a problem hiding this comment.
Agreed. Also the same type of tablet/mysql loss is something that needs to be handled already by an operator as vttablet/mysqld could go down and fall behind the ability to catch up for several reasons already, e.g., network partition. This failure mode should be well covered by normal tooling.
The thing I'm more interested in actually is going to be
when trying to change tablet type and state after completing a backup, if the tablet is lagged, it stays in BACKUP type until it is caught up and only then transitions into REPLICA type. This seems to be an artifact of how state_change has been implemented in the code.
|
@setassociative is this a concern? Can you articulate what you see as potential problems with it?
|
|
@deepthi With respect to the BACKUP mode thing: sorry, I don't have concerns and should have been more clear. I was calling it out simply because that felt the more "new" change and thus a little more interesting. |
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
…dMaster during backup/restore. separate disallowQueryService from disallowQueryReason. disallowQueryReason was being used to permanently disable query service, but for lagging tablets we want to disable it temporarily. change ExecuteBackup and ExecuteRestore to accept BackupParams and RestoreParams instead of a long list of arguments. Signed-off-by: deepthi <deepthi@planetscale.com>
… minor edits Signed-off-by: deepthi <deepthi@planetscale.com>
sougou
left a comment
There was a problem hiding this comment.
The more stable fix will be to encapsulate this functionality within the module that reports replica lag so all users can benefit from this, which includes other healthcheck workflows.
We can do that as a different PR.

Fixes #4426
SHOW SLAVE STATUSgets set to 0 when replication is stopped and restarted. Implemented logic in backup/restore to ensure that either replication has caught up to master, or has progressed from last known position before this can be trusted to compute replication delay.Signed-off-by: deepthi deepthi@planetscale.com