Refresh ephemeral information before cluster operations in VTOrc#10115
Refresh ephemeral information before cluster operations in VTOrc#10115GuptaManan100 merged 12 commits intovitessio:mainfrom
Conversation
Signed-off-by: Manan Gupta <manan@planetscale.com>
…y before running ERS Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…f the file to use Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…as well Signed-off-by: Manan Gupta <manan@planetscale.com>
…nning PRS Signed-off-by: Manan Gupta <manan@planetscale.com>
…re running any fix Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
| if checkAndSetIfERSInProgress() { | ||
| AuditTopologyRecovery(topologyRecovery, "an ERS is already in progress, not issuing another") | ||
| return false, topologyRecovery, nil | ||
| } | ||
| defer setERSCompleted() | ||
|
|
There was a problem hiding this comment.
Is it safe to remove this?
There was a problem hiding this comment.
Yes it safe to remove now. Let me start with why it was introduced in the first place. I introduced this logic when I initially changed the checkAndRecoverDeadPrimary to use EmergencyReparentShard in #8492.
Before this PR, we used to lock the shard as part of the ERS code, i.e. after we have called ERS. Because the cluster failure checks happen on time ticks and repair operations in separate go routines, we sometimes had situations where two ticks of failure check spawned two different threads of repair, each which would have tried to run an ERS.
The LockShard functionality is blocking and doesn't return an error if the shard is locked. This meant that if there were two calls to ERS, one would get the lock and the other would wait for it. After the first finished its execution, we had another run of ERS. So, we had two runs of ERS for a single failure, which was not correct. The solution that we decided on then, was to add this code, which was a mutex lock and a variable to know if ERS was in progress from this same VTOrc instance. This prevented us from spawning another ERS call when one was already running.
This was a temporary workaround till we addressed federation like we have now. Since we lock the shard much earlier on and check again if the failure is required after refreshing the information, we no longer need this type of locking. Even if there is a second call to checkAndRecoverDeadPrimary it would block on acquiring the lock in the beginning, and once it does, it will reload the information and then decide whether to run ERS or not. So, if the first call to ERS fixed everything, there would be nothing to do and we exit out.
So, the change in this PR (along with one more to follow(described below)) actually is powerful enough to address three concerns -
- Redundant operations from the same VTOrc instance (Like we see above)
- Redundant operations from VTOrc when a cluster operation is run from
vtctld(Like PRS, which I have added a test for now) - Federation between different VTOrc instances. With this change, it should be safe to have multiple VTOrc instances without them stepping on each others toes because this change guarantees us that we only run a cluster operation when it is needed.
For the sake of completeness, I will also add that this reload of information has been added to 3 of the 5 functions we use for fixing things. The two that I have not yet added this to, are fixPrimary and fixReplica which only fix the replication parameters on the primary and replica respectively. I intend to make that change as well, but I didn't want to add any code that didn't have an associated end to end test with it. So, I'll make this change in the next PR which first refactors the way we get the replication analysis and run the operations. Which will enable us to add unit tests for this change too.
Description
This PR adds the functionality to VTOrc to refresh its ephemeral information before running any cluster operation since there might be some other agent which could have run a cluster operation in the mean time. To ensure we are not making decisions using stale data, after locking the shard, we refresh the information for all the tablets. We then verify that the problem is indeed there and then go ahead to fix it.
After this change, running
PlannedReparentShardfrom vtctld will not cause vtorc to go bonkers and run a reparent operations of its own. It will see something has changed but after refreshing its information, see that everything is okay and not do anything unnecessary.The way the code is structured now, it was very hard to add a unit test, so for now I have added end to end tests for this too, but this code can be refactored to be much simpler, which will also allow us to add strong unit tests.
Up until now, Vtorc has not have had the capability to refresh only some tablets/shards information. The refresh we have currently refreshes all the tablets information which is unnecessary especially when we are holding the lock. This is marked as a TODO in this PR and will be fixed shortly. Not adding this change in this PR, to keep it easy to review.
Related Issue(s)
Checklist
Deployment Notes