Skip to content

Consider repl NOT healthy with IO_Thread connecting and last attempt failed#9789

Closed
mattlord wants to merge 9 commits intovitessio:mainfrom
planetscale:repl_io_thread_conn_with_err
Closed

Consider repl NOT healthy with IO_Thread connecting and last attempt failed#9789
mattlord wants to merge 9 commits intovitessio:mainfrom
planetscale:repl_io_thread_conn_with_err

Conversation

@mattlord
Copy link
Copy Markdown
Member

@mattlord mattlord commented Feb 25, 2022

Description

We considered the Slave_IO_Running state of Connecting as equivalent to Running in the replication status results that we get from MySQL. I'm assuming this was done to avoid flapping on low traffic systems due to the -slave_net_timeout reconnects and avoid doing a tablet repair when replication was healthy (with this PR we can distinguish running from healthy states and take differing actions depending on what we want).

After #9308 we properly estimate the replica lag when MySQL is telling us that it does not know, meaning it returns a NULL value for the Seconds_Behind_Master field. But in some cases — e.g. it appears to happen when attempting the first connection to the source — MySQL will report a Seconds_Behind_Master value of 0 (meaning fully caught up and no lag) even when it is not connected to its replication source and has failed to reconnect — meaning that this is not a simple reconnect for any reason (e.g. -slave_net_timeout), but a reconnect with one or more failures/errors. This PR handles this latter case within Vitess by continuing to treat Connecting as equivalent to Running via ReplicationStatus.Healthy() — to prevent the noted flapping and errant/unnecessary tablet repairs — unless we had an IO error the last time we tried to reconnect to the replication source.

Note: I think that this was also what in effect caused the bug seen in #9788 as the new replica tablet was considered healthy when in fact it had not been able to connect to its source (for the very first time)

Related Issue(s)

Checklist

  • Should this PR be backported? NO
  • Tests are not required (I don't think)
  • Documentation is not required

@shlomi-noach
Copy link
Copy Markdown
Contributor

I have a vague memory of having this discussion elsewhere in the code. Also, I had a similar in orchestrator. IMO the correct way to handle this is to avoid using a bool for Slave_IO_Running. The problem with keeping the behavior as before this PR is clear. But this PR also creates a confusion: it will say the IO thread is not running when it could be in the interim of completing a connection. So that's also confusing. Example for a risk to this approach: you check replication status, you get false for IO thread running; you assume replication is stopped; you run some logic that relies on this assumption, but unknown to you, the IO thread finally succeeds completing a connection and all of a sudden replication is running.

So IMO the state of IO thread should be an enum like stopped, connecting, running.

@mattlord
Copy link
Copy Markdown
Member Author

I have a vague memory of having this discussion elsewhere in the code. Also, I had a similar in orchestrator. IMO the correct way to handle this is to avoid using a bool for Slave_IO_Running. The problem with keeping the behavior as before this PR is clear. But this PR also creates a confusion: it will say the IO thread is not running when it could be in the interim of completing a connection. So that's also confusing. Example for a risk to this approach: you check replication status, you get false for IO thread running; you assume replication is stopped; you run some logic that relies on this assumption, but unknown to you, the IO thread finally succeeds completing a connection and all of a sudden replication is running.

So IMO the state of IO thread should be an enum like stopped, connecting, running.

I like this idea. It's more accurate (the SQL_Thread really does only have a binary state whereas the IO_Thread has 3 states) and it allows the caller to decide whether or not to consider connecting and running equivalent or perform different actions. I'll work on this now.

Thank you!

@mattlord mattlord requested a review from deepthi as a code owner February 28, 2022 22:28
@mattlord mattlord force-pushed the repl_io_thread_conn_with_err branch 3 times, most recently from 0f94c44 to d1300d1 Compare March 1, 2022 04:31
@mattlord mattlord requested a review from shlomi-noach March 1, 2022 06:34
@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 1, 2022

@shlomi-noach I applied all of your great suggestions and the tests are now passing. If you have time for another quick review pass (not urgent), I'd most appreciate it.

Thank you again!

Copy link
Copy Markdown
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 7, 2022

@shlomi-noach I pushed some more commits, if you have a moment.

I realized that I was altering some behavior when I didn't intend to. I also added some helpers and shortened function names when the context was there twice (ReplicationStatus.Replication*).

In the process I noticed and fixed a bug in replication tablet startup (WaitForReplicationStart) too.

Copy link
Copy Markdown
Member Author

@mattlord mattlord Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was a bug. It should be using Running() and NOT Healthy() because it's supposed to wait until we know that we can successfully connect to the source (we were NOT doing that before).

@mattlord mattlord changed the title Consider repl NOT running with IO_Thread connecting and last attempt failed Consider repl NOT healthy with IO_Thread connecting and last attempt failed Mar 7, 2022
@mattlord mattlord requested review from GuptaManan100 and removed request for deepthi March 7, 2022 18:03
Copy link
Copy Markdown
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Copy link
Copy Markdown
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don';t have full context into the problem, and am assuming that the existing tests cover the behavior exhibited by this change.

@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 8, 2022

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes?

It would be nice to e.g. remove the io_thread_running and io_thread_connecting booleans and replace them with an io_state int field.

@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Mar 9, 2022

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes?

It would be nice to e.g. remove the io_thread_running and io_thread_connecting booleans and replace them with an io_state int field.

The way to make such changes to proto definitions is to use reserved. you replace the old field with something like reserved 2 and add the new field. I realize that we have a couple of fields with deprecated=true, those slipped through reviews, we should never have done that.

@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 9, 2022

Everything looks good to me! I just have 1 question. Is there any way to cleanup the old fields in Proto. I know they cannot be removed immediately for upgrade-downgrade considerations, but can they be marked for deprecation or something like that?

Good question! I was wondering that too, but I don't know the answer. @deepthi do we have a defined process or an example to follow regarding making backwards compatibility breaking protobuf changes?
It would be nice to e.g. remove the io_thread_running and io_thread_connecting booleans and replace them with an io_state int field.

The way to make such changes to proto definitions is to use reserved. you replace the old field with something like reserved 2 and add the new field. I realize that we have a couple of fields with deprecated=true, those slipped through reviews, we should never have done that.

OK, thanks! I'll investigate to see if there's a way I can safely effectively deprecate the io_thread_running and sql_thread_connecting booleans and effectively replace them with io_state and sql_state ints (and get rid of the io_thread_connecting boolean added in the PR). I was assuming that would be forbidden but perhaps I was wrong. If we can do that it would keep the proto to struct translation clean and simple.

@mattlord mattlord requested review from deepthi and removed request for ajm188, doeg, notfelineit and rohit-nayak-ps March 9, 2022 02:35
mattlord added 8 commits March 9, 2022 10:31
…failed

Signed-off-by: Matt Lord <mattalord@gmail.com>
…ationStatus

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
And make function names nicer (removing double context)

And add some helper functions

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 9, 2022

I apologize, but I'm going to rebase via FF merge from vitessio/main and force-push as GitHub Actions is somehow checking out an invalid snapshot and I'm getting CI errors for code that doesn't exist in my branch's HEAD. I'll share another comment with the diff covering the changes since the last reviews once the tests are passing again.

@mattlord mattlord force-pushed the repl_io_thread_conn_with_err branch from e88bc8b to 974633f Compare March 9, 2022 15:34
@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 9, 2022

Ugh, still happening. You can see an example here with errors about code that doesn't exist in my branch: https://github.com/vitessio/vitess/runs/5482587430?check_suite_focus=true

The issue is that GH Actions merges main from the merge target (vitessio/main) in PRs**, and main had some incompatible additions that were made to go/vt/vtctl/reparentutil/emergency_reparenter_test.go in 66d7b9a. Since those changes were made after this PR branch was branched off, those changes get merged in the Action Checkout stage and then things break. I addressed this in a local branch and had to force push to the PRs remote.

**

/usr/bin/git checkout --progress --force refs/remotes/pull/9789/merge
  Note: switching to 'refs/remotes/pull/9789/merge'.
  
HEAD is now at ab2e07e Merge 974633fa50865bc7d39677c059eb8f570c474f49 into 8c92a04efb89b3b7a74a1db7f366ff5707114cc3

Update vtadmin web protobufs

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the repl_io_thread_conn_with_err branch from 974633f to d4099fe Compare March 9, 2022 16:47
@mattlord
Copy link
Copy Markdown
Member Author

mattlord commented Mar 9, 2022

Force pushing from a fixed local branch to this PRs remote didn't seem to do it either. 😢 Moving this to a new PR: #9853

@mattlord mattlord closed this Mar 9, 2022
@mattlord mattlord deleted the repl_io_thread_conn_with_err branch March 9, 2022 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: vttablet replica is considered healthy (serving) even when not connect to primary

4 participants