Skip to content

check replication lag on state change before starting query service#5000

Merged
deepthi merged 4 commits intovitessio:masterfrom
planetscale:ds-4426
Sep 18, 2019
Merged

check replication lag on state change before starting query service#5000
deepthi merged 4 commits intovitessio:masterfrom
planetscale:ds-4426

Conversation

@deepthi
Copy link
Collaborator

@deepthi deepthi commented Jul 15, 2019

Fixes #4426

  • check replication delay on replica during state_change before setting it to SERVING
  • restored tablets should always start as NOT_SERVING
  • SecondsBehindMaster from SHOW SLAVE STATUS gets set to 0 when replication is stopped and restarted. Implemented logic in backup/restore to ensure that either replication has caught up to master, or has progressed from last known position before this can be trusted to compute replication delay.

Signed-off-by: deepthi deepthi@planetscale.com

@deepthi deepthi requested a review from sougou as a code owner July 15, 2019 20:14
Copy link

@setassociative setassociative left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good -- interested in what your approach to validating this was with live data and if you had any problems with the tablet getting stuck in a non-serving mode (iirc I was never able to figure out why that happened to me).

@deepthi deepthi requested review from dweitzman and rafael July 23, 2019 00:18
@deepthi deepthi changed the title WIP: check replication lag on state change before starting query service check replication lag on state change before starting query service Jul 23, 2019
@rafael
Copy link
Member

rafael commented Jul 25, 2019

@deepthi - This looks great. I think there are some Println that we need to remove from the tests.

Before we merge. Could you verify manually that this works? Run a manual integration test? The way we reproduce this is:

  1. Have a script that writes tons of data and create lag.
  2. Take a backup while this script is running.
  3. Have a script that reads from replica.
  4. Notice the tablet UI that when it comes from backup, it does no longer serve any query.

@sougou
Copy link
Contributor

sougou commented Jul 27, 2019

This change looks good. @deepthi I can merge this once you've satisfied @rafael's request.

@deepthi
Copy link
Collaborator Author

deepthi commented Jul 29, 2019

This change looks good. @deepthi I can merge this once you've satisfied @rafael's request.

In runHealthCheckLocked, when we set _replicationDelay we also set _healthy and _healthyTime. Should these fields also be set here?

@rafael
Copy link
Member

rafael commented Aug 14, 2019

@deepthi - I was doing some of our internal testing I think the bug is still present in this branch. I think I was able to create instructions that should help you reproduce in any environment.

Using vtbench:

  1. Create a credential files with this format:
# vtbench_mysql_creds.json
{
   "vtgate_user":[
      "vt_pass"
  ]
}
  1. Create a table with the following schema:
CREATE TABLE `test_table` (
  `i` int(11) DEFAULT NULL,
  `c` char(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
  1. Use vtbench to insert data into the table:
vtbench -host localhost -protocol mysql -port 15306 -user vtgate_user -db-credentials-file ./vtbench_mysql_creds.json -db @master --count 300000 --threads 25 -sql "INSERT INTO test_table (i,c)  VALUES(1,'record one') /* vt_bench:thread */"```
  1. This should create lag pretty quickly in the replicas.

  2. Once replicas are lagging run the following:

while true; do vtbench -host localhost -protocol mysql -port 15306 -user vtgate_user -db-credentials-file ./vtbench_mysql_creds.json -db @replica --count 30000 --threads 25 -sql 'select count(*) from test_table'; done

If lag is beyond the serving threshold, you will see vtbench failures.

  1. Trigger a backup to one of the replicas.

  2. You will notice that when the replica comes back from backup, some queries slip in and get served:

Screen Shot 2019-08-13 at 5 32 37 PM

@deepthi
Copy link
Collaborator Author

deepthi commented Sep 12, 2019

The latest changes do ensure that replica does not go into serving if it is lagged more than unhealthyThreshold after completing a backup.
Scenarios tested:

  • tablet is healthy before backup, comes back to serving after backup
  • tablet is degraded before backup, still degraded after backup, comes back to serving
  • tablet is degraded before backup, unhealthy after backup, goes to non_serving
  • tablet is unhealthy before backup, stays non_serving after backup
  • when tablet is restored from backup, it goes non_serving if the backup is far behind current master, and goes serving only after catching up

There are a few oddities to be noted:

  • when trying to change tablet type and state after completing a backup, if the tablet is lagged, it stays in BACKUP type until it is caught up and only then transitions into REPLICA type. This seems to be an artifact of how state_change has been implemented in the code.
  • in my testing, I stop writing to the master after a while and wait for the replica to catch up (post-backup). The lag will keep growing even as the gap between Executed_Gtid_Set and Retrieved_Gtid_Set narrows. I suspect this is because there are no new transactions on the master (which would trigger a proper re-eval of SecondsBehindMaster). Only when the Executed_Gtid_Set catches up to master, the SecondsBehindMaster goes to 0. So instead of gradually dropping, it suddenly goes to 0. This is probably not going to happen in a real system where there is continuous traffic to master.

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepthi - Nice work! I think this approach should be good for what we need. Added minor comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new failure mode that we need to be aware, but I think it makes sense to fail if we can't get master position.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Also the same type of tablet/mysql loss is something that needs to be handled already by an operator as vttablet/mysqld could go down and fall behind the ability to catch up for several reasons already, e.g., network partition. This failure mode should be well covered by normal tooling.

The thing I'm more interested in actually is going to be

when trying to change tablet type and state after completing a backup, if the tablet is lagged, it stays in BACKUP type until it is caught up and only then transitions into REPLICA type. This seems to be an artifact of how state_change has been implemented in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's necessary to extract all the params into separate vars, unless you need a few of the vars to overwrite, and then i'd only make those.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for 2 reasons:

  • it serves as documentation of what params we actually use from the object in this function
  • it avoided making a whole lot of name changes through out the functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the docs belong in the params struct docs. if it isn't used, then there was not point of putting it in the struct.

if we never change the lines, then the code will get progressively more cluttered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that params struct is used to pass the params down two levels of calls. Not all params are used at both levels so it is a union of the two sets of params. Do you still feel like we should not extract the params into vars?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$0.02: I find that the backup param makes ExecuteBackup & RestoreBackup much easier to read.

If the primary reason for extracting args is to minimize churn in this review I would suggest a follow up where we move to dereferencing params instead of extracting them like this. That would leave the interface cleaner / more easy to scan and remove this noisy block of param extraction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$0.02: I find that the backup param makes ExecuteBackup & RestoreBackup much easier to read.

If the primary reason for extracting args is to minimize churn in this review I would suggest a follow up where we move to dereferencing params instead of extracting them like this. That would leave the interface cleaner / more easy to scan and remove this noisy block of param extraction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because your params structs are used as an abstraction for "control of some action this mysqlctl API takes" I would suggest dropping BackupHandle as a user facing value. It seems that anything outside of mysqlctl.Backup|Restore is going to have the value they set overridden so we might as well not complicate matters by allowing it to be set before calling the Backup|Restore func.

No strong opinions on if this is done by making private or passing it as a separate param outside this abstraction in to the ultimate ExecuteBackup/FindBackupToRestore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I'll change this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For BackupParams and RestoreParams docs on the non-obvious args would be ✨. (To me the non-obvious one is mostly HookExtraEnv though I'm just assuming I understand DbName, LocalMetadata, and what the Keyspace/Shard pair is for).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc nit: "Wait for a reliable 'seconds behind master' value" or something. My initial read of this was that it was going to wait for a number of seconds that we considered reliable to have us get a new seconds behind master reading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Also the same type of tablet/mysql loss is something that needs to be handled already by an operator as vttablet/mysqld could go down and fall behind the ability to catch up for several reasons already, e.g., network partition. This failure mode should be well covered by normal tooling.

The thing I'm more interested in actually is going to be

when trying to change tablet type and state after completing a backup, if the tablet is lagged, it stays in BACKUP type until it is caught up and only then transitions into REPLICA type. This seems to be an artifact of how state_change has been implemented in the code.

@deepthi
Copy link
Collaborator Author

deepthi commented Sep 14, 2019

@setassociative is this a concern? Can you articulate what you see as potential problems with it?

when trying to change tablet type and state after completing a backup, if the tablet is lagged, it stays in BACKUP type until it is caught up and only then transitions into REPLICA type. This seems to be an artifact of how state_change has been implemented in the code.

@setassociative
Copy link

@deepthi With respect to the BACKUP mode thing: sorry, I don't have concerns and should have been more clear. I was calling it out simply because that felt the more "new" change and thus a little more interesting.

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
…dMaster during backup/restore.

separate disallowQueryService from disallowQueryReason.
disallowQueryReason was being used to permanently disable query service,
but for lagging tablets we want to disable it temporarily.
change ExecuteBackup and ExecuteRestore to accept BackupParams and
RestoreParams instead of a long list of arguments.

Signed-off-by: deepthi <deepthi@planetscale.com>
… minor edits

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more stable fix will be to encapsulate this functionality within the module that reports replica lag so all users can benefit from this, which includes other healthcheck workflows.

We can do that as a different PR.

@deepthi deepthi merged commit 3a21af2 into vitessio:master Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tablets serving stale data after backup

5 participants