Consider repl NOT healthy with IO_Thread connecting and last attempt failed by mattlord · Pull Request #9789 · vitessio/vitess

mattlord · 2022-02-25T22:15:23Z

Description

We considered the Slave_IO_Running state of Connecting as equivalent to Running in the replication status results that we get from MySQL. I'm assuming this was done to avoid flapping on low traffic systems due to the -slave_net_timeout reconnects and avoid doing a tablet repair when replication was healthy (with this PR we can distinguish running from healthy states and take differing actions depending on what we want).

After #9308 we properly estimate the replica lag when MySQL is telling us that it does not know, meaning it returns a NULL value for the Seconds_Behind_Master field. But in some cases — e.g. it appears to happen when attempting the first connection to the source — MySQL will report a Seconds_Behind_Master value of 0 (meaning fully caught up and no lag) even when it is not connected to its replication source and has failed to reconnect — meaning that this is not a simple reconnect for any reason (e.g. -slave_net_timeout), but a reconnect with one or more failures/errors. This PR handles this latter case within Vitess by continuing to treat Connecting as equivalent to Running via ReplicationStatus.Healthy() — to prevent the noted flapping and errant/unnecessary tablet repairs — unless we had an IO error the last time we tried to reconnect to the replication source.

Note: I think that this was also what in effect caused the bug seen in #9788 as the new replica tablet was considered healthy when in fact it had not been able to connect to its source (for the very first time)

Related Issue(s)

Checklist

Should this PR be backported? NO
Tests are not required (I don't think)
Documentation is not required

shlomi-noach · 2022-02-28T09:23:44Z

I have a vague memory of having this discussion elsewhere in the code. Also, I had a similar in orchestrator. IMO the correct way to handle this is to avoid using a bool for Slave_IO_Running. The problem with keeping the behavior as before this PR is clear. But this PR also creates a confusion: it will say the IO thread is not running when it could be in the interim of completing a connection. So that's also confusing. Example for a risk to this approach: you check replication status, you get false for IO thread running; you assume replication is stopped; you run some logic that relies on this assumption, but unknown to you, the IO thread finally succeeds completing a connection and all of a sudden replication is running.

So IMO the state of IO thread should be an enum like stopped, connecting, running.

mattlord · 2022-02-28T17:09:14Z

I have a vague memory of having this discussion elsewhere in the code. Also, I had a similar in orchestrator. IMO the correct way to handle this is to avoid using a bool for Slave_IO_Running. The problem with keeping the behavior as before this PR is clear. But this PR also creates a confusion: it will say the IO thread is not running when it could be in the interim of completing a connection. So that's also confusing. Example for a risk to this approach: you check replication status, you get false for IO thread running; you assume replication is stopped; you run some logic that relies on this assumption, but unknown to you, the IO thread finally succeeds completing a connection and all of a sudden replication is running.

So IMO the state of IO thread should be an enum like stopped, connecting, running.

I like this idea. It's more accurate (the SQL_Thread really does only have a binary state whereas the IO_Thread has 3 states) and it allows the caller to decide whether or not to consider connecting and running equivalent or perform different actions. I'll work on this now.

Thank you!

go/mysql/constants.go

go/mysql/flavor_mysqlgr.go

go/mysql/constants.go

mattlord · 2022-03-01T06:38:30Z

@shlomi-noach I applied all of your great suggestions and the tests are now passing. If you have time for another quick review pass (not urgent), I'd most appreciate it.

Thank you again!

shlomi-noach

🚀

mattlord · 2022-03-07T17:10:54Z

@shlomi-noach I pushed some more commits, if you have a moment.

I realized that I was altering some behavior when I didn't intend to. I also added some helpers and shortened function names when the context was there twice (ReplicationStatus.Replication*).

In the process I noticed and fixed a bug in replication tablet startup (WaitForReplicationStart) too.

mattlord · 2022-03-07T17:34:55Z

go/vt/mysqlctl/replication.go

Note: this was a bug. It should be using Running() and NOT Healthy() because it's supposed to wait until we know that we can successfully connect to the source (we were NOT doing that before).

GuptaManan100

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

shlomi-noach

LGTM. I don';t have full context into the problem, and am assuming that the existing tests cover the behavior exhibited by this change.

mattlord · 2022-03-08T15:19:45Z

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes?

It would be nice to e.g. remove the io_thread_running and io_thread_connecting booleans and replace them with an io_state int field.

deepthi · 2022-03-09T00:59:34Z

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes?

It would be nice to e.g. remove the io_thread_running and io_thread_connecting booleans and replace them with an io_state int field.

The way to make such changes to proto definitions is to use reserved. you replace the old field with something like reserved 2 and add the new field. I realize that we have a couple of fields with deprecated=true, those slipped through reviews, we should never have done that.

mattlord · 2022-03-09T01:12:30Z

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes?
It would be nice to e.g. remove the io_thread_running and io_thread_connecting booleans and replace them with an io_state int field.

The way to make such changes to proto definitions is to use reserved. you replace the old field with something like reserved 2 and add the new field. I realize that we have a couple of fields with deprecated=true, those slipped through reviews, we should never have done that.

OK, thanks! I'll investigate to see if there's a way I can safely effectively deprecate the io_thread_running and sql_thread_connecting booleans and effectively replace them with io_state and sql_state ints (and get rid of the io_thread_connecting boolean added in the PR). I was assuming that would be forbidden but perhaps I was wrong. If we can do that it would keep the proto to struct translation clean and simple.

…failed Signed-off-by: Matt Lord <mattalord@gmail.com>

…ationStatus Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

And make function names nicer (removing double context) And add some helper functions Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2022-03-09T15:33:20Z

I apologize, but I'm going to rebase via FF merge from vitessio/main and force-push as GitHub Actions is somehow checking out an invalid snapshot and I'm getting CI errors for code that doesn't exist in my branch's HEAD. I'll share another comment with the diff covering the changes since the last reviews once the tests are passing again.

mattlord · 2022-03-09T15:58:29Z

Ugh, still happening. You can see an example here with errors about code that doesn't exist in my branch: https://github.com/vitessio/vitess/runs/5482587430?check_suite_focus=true

The issue is that GH Actions merges main from the merge target (vitessio/main) in PRs**, and main had some incompatible additions that were made to go/vt/vtctl/reparentutil/emergency_reparenter_test.go in 66d7b9a. Since those changes were made after this PR branch was branched off, those changes get merged in the Action Checkout stage and then things break. I addressed this in a local branch and had to force push to the PRs remote.

**

/usr/bin/git checkout --progress --force refs/remotes/pull/9789/merge
  Note: switching to 'refs/remotes/pull/9789/merge'.
  
HEAD is now at ab2e07e Merge 974633fa50865bc7d39677c059eb8f570c474f49 into 8c92a04efb89b3b7a74a1db7f366ff5707114cc3

Update vtadmin web protobufs Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2022-03-09T17:04:10Z

Force pushing from a fixed local branch to this PRs remote didn't seem to do it either. 😢 Moving this to a new PR: #9853

mattlord added Type: Bug Backport me! Component: Query Serving labels Feb 25, 2022

mattlord requested review from harshit-gangal and systay as code owners February 25, 2022 22:15

deepthi requested a review from shlomi-noach February 25, 2022 22:32

mattlord removed the Backport me! label Feb 25, 2022

mattlord requested a review from deepthi as a code owner February 28, 2022 22:28

mattlord force-pushed the repl_io_thread_conn_with_err branch 3 times, most recently from 0f94c44 to d1300d1 Compare March 1, 2022 04:31

shlomi-noach reviewed Mar 1, 2022

View reviewed changes

go/mysql/constants.go Outdated Show resolved Hide resolved

go/mysql/constants.go Outdated Show resolved Hide resolved

go/mysql/flavor_mysqlgr.go Outdated Show resolved Hide resolved

go/mysql/constants.go Outdated Show resolved Hide resolved

mattlord requested a review from shlomi-noach March 1, 2022 06:34

shlomi-noach approved these changes Mar 1, 2022

View reviewed changes

mattlord removed request for harshit-gangal and systay March 2, 2022 22:12

deepthi mentioned this pull request Mar 3, 2022

replmanager: Let TabletManger decide if replication needs to be repaired #9809

Closed

3 tasks

mattlord requested a review from shlomi-noach March 7, 2022 17:09

mattlord commented Mar 7, 2022

View reviewed changes

mattlord changed the title ~~Consider repl NOT running with IO_Thread connecting and last attempt failed~~ Consider repl NOT healthy with IO_Thread connecting and last attempt failed Mar 7, 2022

mattlord requested review from GuptaManan100 and removed request for deepthi March 7, 2022 18:03

GuptaManan100 reviewed Mar 8, 2022

View reviewed changes

shlomi-noach approved these changes Mar 8, 2022

View reviewed changes

mattlord requested review from ajm188, doeg, notfelineit and rohit-nayak-ps as code owners March 9, 2022 02:34

mattlord requested review from deepthi and removed request for ajm188, doeg, notfelineit and rohit-nayak-ps March 9, 2022 02:35

mattlord added 8 commits March 9, 2022 10:31

Consider repl NOT running with IO_Thread connecting and last attempt …

6f62fed

…failed Signed-off-by: Matt Lord <mattalord@gmail.com>

Use replication IO/SQL thread status and last IO/SQL errors in Replic…

bc21245

…ationStatus Signed-off-by: Matt Lord <mattalord@gmail.com>

Minor changes & improvements after review

8a65487

Signed-off-by: Matt Lord <mattalord@gmail.com>

Correct GroupReplication flavor unit test

1bff20c

Signed-off-by: Matt Lord <mattalord@gmail.com>

Correcting minor mistakes in func comment

4692825

Signed-off-by: Matt Lord <mattalord@gmail.com>

Remove unintentional behavior changes

54f55c6

And make function names nicer (removing double context) And add some helper functions Signed-off-by: Matt Lord <mattalord@gmail.com>

One more place where we would have slightly changed behavior

c18230f

Signed-off-by: Matt Lord <mattalord@gmail.com>

One more spot that keep existing behavior

1d83d0f

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the repl_io_thread_conn_with_err branch from e88bc8b to 974633f Compare March 9, 2022 15:34

Modify the replicationdata proto w/o breaking backwards compat

d4099fe

Update vtadmin web protobufs Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the repl_io_thread_conn_with_err branch from 974633f to d4099fe Compare March 9, 2022 16:47

mattlord mentioned this pull request Mar 9, 2022

Manage MySQL Replication Status States Properly #9853

Merged

3 tasks

mattlord closed this Mar 9, 2022

mattlord deleted the repl_io_thread_conn_with_err branch March 9, 2022 17:04

Conversation

mattlord commented Feb 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Uh oh!

shlomi-noach commented Feb 28, 2022

Uh oh!

mattlord commented Feb 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattlord commented Mar 1, 2022

Uh oh!

shlomi-noach left a comment

Choose a reason for hiding this comment

Uh oh!

mattlord commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattlord Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GuptaManan100 left a comment

Choose a reason for hiding this comment

Uh oh!

shlomi-noach left a comment

Choose a reason for hiding this comment

Uh oh!

mattlord commented Mar 8, 2022

Uh oh!

deepthi commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattlord commented Mar 9, 2022

Uh oh!

mattlord commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattlord commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattlord commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mattlord commented Feb 25, 2022 •

edited

Loading

mattlord commented Mar 7, 2022 •

edited

Loading

mattlord Mar 7, 2022 •

edited

Loading

deepthi commented Mar 9, 2022 •

edited

Loading

mattlord commented Mar 9, 2022 •

edited

Loading

mattlord commented Mar 9, 2022 •

edited

Loading