Skip to content

Backport: Only start SQL thread temporarily to WaitForPosition if needed#10123

Merged
mattlord merged 2 commits intovitessio:release-13.0from
planetscale:backport10104_v13
Apr 24, 2022
Merged

Backport: Only start SQL thread temporarily to WaitForPosition if needed#10123
mattlord merged 2 commits intovitessio:release-13.0from
planetscale:backport10104_v13

Conversation

@mattlord
Copy link
Copy Markdown
Member

@mattlord mattlord commented Apr 21, 2022

⚠️ This does not require/warrant another 13.0 patch release but rather should be included in the next patch release whenever it's made ⚠️

Description

After #9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired.

This changes the behavior so that:

  1. We only attempt to start the SQL_Thread(s) if it's not already running
  2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call

Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before #9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests are not required
  • Documentation is not required

…sio#10104)

After vitessio#9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired.

This changes the behavior so that:
  1. We only attempt to start the SQL_Thread(s) if it's not already running
  2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call

Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before vitessio#9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Copy Markdown
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks good to me

As release-13.0 does not have this:
  vitessio#9853

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit d685b18 into vitessio:release-13.0 Apr 24, 2022
@mattlord mattlord deleted the backport10104_v13 branch April 24, 2022 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants