-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in InboundLedgers and NetworkOPs #5124
Conversation
b07ab0f
to
b68f842
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5124 +/- ##
=========================================
+ Coverage 77.7% 77.9% +0.2%
=========================================
Files 779 782 +3
Lines 66015 66614 +599
Branches 8156 8140 -16
=========================================
+ Hits 51261 51887 +626
+ Misses 14754 14727 -27
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just me, or do these changes not actually change any behavior?
In InboundLedgers::acquireAsync()
, we have this sequence before:
- 142:
unlock()
. - 143: call
acquire()
. If it does not throw, continue to line 157. If it does throw, one of the exception handlers is entered (no exception can skip both handlers), neither throws (they just log warnings), and execution continues to line 157. Either way, we end up at line 157. - 157:
lock()
.
Now, I prefer a RAII type like scope_unlock
, but I think the only difference in behavior it produces here is that the call to lock()
is moved before the exception handler is entered (if one is entered). But that is not a material difference. Is that right?
Very similar story for NetworkOPsImp::recvValidation
.
Review feedback Co-authored-by: John Freeman <[email protected]>
The change in behaviour is if we have an exception before
Yes, same applies - there's an The RAII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be redundant with the CanProcess
class in #5126. I'm biased, but I prefer my solution because it hides all the locking in a single class. I'm open to being convinced otherwise, though.
you are right that your PR solves this problem, however given the number of other issues that your PR solves, and the associated code churn, it might be a little until it is tested and approved. In the meantime, if we get this merged and, some time after, your PR merged as well, we will eventually end up with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Edit: I left one small suggested change to a comment, but whether you take it or not, this is good to go.
} // mut gets locked here. | ||
|
||
} // mut gets unlocked here | ||
@endcode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to drive the point home:
} // mut gets locked here. | |
} // mut gets unlocked here | |
@endcode | |
} // mut gets locked here. | |
... do some more stuff with it locked ... | |
} // mut gets unlocked here | |
@endcode |
* 2.2.2 changed functions acquireAsync and NetworkOPsImp::recvValidation to add an item to a collection under lock, unlock, do some work, then lock again to do remove the item. It will deadlock if an exception is thrown while adding the item - before unlocking. * Replace ScopedUnlock with scope_unlock.
High Level Overview of Change
Fixes a deadlock bug in 2.2.2 release.
Context of Change
Functions
acquireAsync
andNetworkOPsImp::recvValidation
containslock.lock()
which will deadlock if an exception is thrown while the (non-recursive) mutex is owned by the calling thread, that is beforelock.unlock()
In 2.2.2 release we switched some operations from synchronous to asynchronous, guarded by mutexes, but failed to account for exception safety of the critical section, which may result in a deadlock if an exception is thrown before
lock.unlock()
inside thetry
block. It is extremely unlikely that a deadlock will actually occur in practice, since the possible exceptions in this section of code are scarce.The solution adopted in this PR is to move
ScopedUnlock
fromLedgerMaster.cpp
tobasics/scope.h
(adjusting casing to match other utilities in this file, toscope_unlock
) and to use this type to create a RAII unlock, removing the problematiclock.lock()
.An alternative solution might be to abstract away the "pending operation" with the corresponding mutex, and then use it where appropriate, but I wanted this change to generalise an existing lower-level utility (i.e. scoped unlock) rather than write a new one. The "pending operation" approach could be further generalised into structured asynchronous operations, which would place it in a much larger PR.
Type of Change