Consider repl NOT healthy with IO_Thread connecting and last attempt failed#9789
Consider repl NOT healthy with IO_Thread connecting and last attempt failed#9789mattlord wants to merge 9 commits intovitessio:mainfrom
Conversation
|
I have a vague memory of having this discussion elsewhere in the code. Also, I had a similar in So IMO the state of IO thread should be an enum like |
I like this idea. It's more accurate (the SQL_Thread really does only have a binary state whereas the IO_Thread has 3 states) and it allows the caller to decide whether or not to consider connecting and running equivalent or perform different actions. I'll work on this now. Thank you! |
0f94c44 to
d1300d1
Compare
|
@shlomi-noach I applied all of your great suggestions and the tests are now passing. If you have time for another quick review pass (not urgent), I'd most appreciate it. Thank you again! |
|
@shlomi-noach I pushed some more commits, if you have a moment. I realized that I was altering some behavior when I didn't intend to. I also added some helpers and shortened function names when the context was there twice (ReplicationStatus.Replication*). In the process I noticed and fixed a bug in replication tablet startup ( |
go/vt/mysqlctl/replication.go
Outdated
There was a problem hiding this comment.
Note: this was a bug. It should be using Running() and NOT Healthy() because it's supposed to wait until we know that we can successfully connect to the source (we were NOT doing that before).
GuptaManan100
left a comment
There was a problem hiding this comment.
Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?
shlomi-noach
left a comment
There was a problem hiding this comment.
LGTM. I don';t have full context into the problem, and am assuming that the existing tests cover the behavior exhibited by this change.
Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes? It would be nice to e.g. remove the |
The way to make such changes to proto definitions is to use |
OK, thanks! I'll investigate to see if there's a way I can safely effectively deprecate the |
…failed Signed-off-by: Matt Lord <mattalord@gmail.com>
…ationStatus Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
And make function names nicer (removing double context) And add some helper functions Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
I apologize, but I'm going to rebase via FF merge from |
e88bc8b to
974633f
Compare
|
Ugh, still happening. You can see an example here with errors about code that doesn't exist in my branch: https://github.com/vitessio/vitess/runs/5482587430?check_suite_focus=true The issue is that GH Actions merges main from the merge target ( ** |
Update vtadmin web protobufs Signed-off-by: Matt Lord <mattalord@gmail.com>
974633f to
d4099fe
Compare
|
Force pushing from a fixed local branch to this PRs remote didn't seem to do it either. 😢 Moving this to a new PR: #9853 |
Description
We considered the
Slave_IO_Runningstate ofConnectingas equivalent toRunningin the replication status results that we get from MySQL. I'm assuming this was done to avoid flapping on low traffic systems due to the-slave_net_timeoutreconnects and avoid doing a tablet repair when replication was healthy (with this PR we can distinguish running from healthy states and take differing actions depending on what we want).After #9308 we properly estimate the replica lag when MySQL is telling us that it does not know, meaning it returns a
NULLvalue for theSeconds_Behind_Masterfield. But in some cases — e.g. it appears to happen when attempting the first connection to the source — MySQL will report aSeconds_Behind_Mastervalue of0(meaning fully caught up and no lag) even when it is not connected to its replication source and has failed to reconnect — meaning that this is not a simple reconnect for any reason (e.g.-slave_net_timeout), but a reconnect with one or more failures/errors. This PR handles this latter case within Vitess by continuing to treatConnectingas equivalent toRunningviaReplicationStatus.Healthy()— to prevent the noted flapping and errant/unnecessary tablet repairs — unless we had an IO error the last time we tried to reconnect to the replication source.Related Issue(s)
Checklist