Merged
Conversation
And fix flaky test Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Change actionMutex to a semaphore to implement a tryLock function in tm, and use it in replManager. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
rafael
reviewed
Aug 26, 2020
deepthi
reviewed
Aug 26, 2020
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A deadlock was found during a PRS. The root cause was a fix where we changed the replmanager to take the action lock. Otherwise, it would potentially race and conflict with other actions. But this led to a deadlock because
PromoteReplicaalso waits for the replmanager to finish its fix.We could have spot-fixed this for the specific use case. But in the interest of preventing other corner cases, the better fix was to change replmanager to not wait if it couldn't obtain a lock.
However, the implementation of
lockwith context timeout was flawed, because it wouldn't really timeout if the context expired. So, I implemented a new AcquireContext in sync2.Semaphore to, which encouraged to fix the flaky tests there.Using the semaphore allowed me to implement a real
tryLock, and replManager could use it.Since this was a race condition, I tested it manually. The test that failed previously now passes.