-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refuse recursive read locks #2006
Refuse recursive read locks #2006
Conversation
Codecov ReportBase: 87.25% // Head: 87.22% // Decreases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #2006 +/- ##
==========================================
- Coverage 87.25% 87.22% -0.04%
==========================================
Files 100 100
Lines 44480 44505 +25
Branches 44480 44505 +25
==========================================
+ Hits 38810 38818 +8
- Misses 5670 5687 +17
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
3adef72
to
46428a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but will make sure to review it after rebased once conflicting things land :)
lightning/src/sync/debug_sync.rs
Outdated
self.inner.read().map(|guard| RwLockReadGuard { lock: self, guard, first_lock }).map_err(|_| ()) | ||
// Note that while we could be taking a recursive read lock here, Rust's `RwLock` may | ||
// deadlock trying to take a second read lock if another thread is waiting on the write | ||
// lock. Its platform dependent (but our in-tree `FairRwLock` guarantees this behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very tiny nit: s/Its/It's/
Needs a rebase |
Converting to draft until at least #1897 because it'll need to be adapted for that, I think. |
46428a7
to
a1bb7ca
Compare
Rebased, fixed some issues and added another check which turned up one more bug (see last commit). |
a1bb7ca
to
267f0a6
Compare
Oops, forgot to update the no-std lock file. |
267f0a6
to
ba66a76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is still failing with some lifetime errors.
ba66a76
to
1877714
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after squash
1877714
to
18b1e86
Compare
Squashed without changes, but pushed one additional commit to export RUST_BACKTRACE=1 in CI because windows is failing and I can't really otherwise debug it locally. |
d459767
to
a91f5e6
Compare
Fixed a silent rebase conflict, see the last commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the one mutex per line can be a little surprising to contributors. I wonder if there's any easy way to enforce this/notify an author.
I see also test transaction sync clients on 1.59 rustc, but now we get: |
Yea, looks like |
a91f5e6
to
b8e36a2
Compare
Good point! I pushed a commit to assert this on non-Windows platforms (it will still generate false-positives on Windows, but at least on other platforms you'll get a helpful assertion failure error message). |
Found one more case, and also modified the log to include the line number: diff --git a/lightning/src/chain/channelmonitor.rs b/lightning/src/chain/channelmonitor.rs
index 44390496..64a458d2 100644
--- a/lightning/src/chain/channelmonitor.rs
+++ b/lightning/src/chain/channelmonitor.rs
@@ -4074,7 +4074,10 @@ mod tests {
fn test_prune_preimages() {
let secp_ctx = Secp256k1::new();
let logger = Arc::new(TestLogger::new());
- let broadcaster = Arc::new(TestBroadcaster{txn_broadcasted: Mutex::new(Vec::new()), blocks: Arc::new(Mutex::new(Vec::new()))});
+ let broadcaster = Arc::new(TestBroadcaster{
+ txn_broadcasted: Mutex::new(Vec::new()),
+ blocks: Arc::new(Mutex::new(Vec::new()))
+ });
let fee_estimator = TestFeeEstimator { sat_per_kw: Mutex::new(253) };
let dummy_key = PublicKey::from_secret_key(&secp_ctx, &SecretKey::from_slice(&[42; 32]).unwrap());
diff --git a/lightning/src/sync/debug_sync.rs b/lightning/src/sync/debug_sync.rs
index 76b139f7..76a7bb41 100644
--- a/lightning/src/sync/debug_sync.rs
+++ b/lightning/src/sync/debug_sync.rs
@@ -110,11 +110,11 @@ impl LockMetadata {
let lock_constr_location = get_construction_location(&res._lock_construction_bt);
LOCKS_INIT.call_once(|| { unsafe { LOCKS = Some(StdMutex::new(HashMap::new())); } });
let mut locks = unsafe { LOCKS.as_ref() }.unwrap().lock().unwrap();
- match locks.entry(lock_constr_location.0) {
+ match locks.entry(lock_constr_location.0.clone()) {
hash_map::Entry::Occupied(e) => {
assert_eq!(lock_constr_location.1,
get_construction_location(&e.get()._lock_construction_bt).1,
- "Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives.");
+ "Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives: {}", lock_constr_location.0);
return Arc::clone(e.get())
},
hash_map::Entry::Vacant(e) => { e.insert(Arc::clone(&res)); },
|
Thanks. |
b8e36a2
to
3ac12e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM, feel free to squash everything on your next push
hash_map::Entry::Occupied(e) => { | ||
assert_eq!(lock_constr_location.1, | ||
get_construction_location(&e.get()._lock_construction_bt).1, | ||
"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we do have to clone it, I still think it's useful to include the actual line where this happens, even if you can simply go through all of your changes to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backtrace in the panic message should have it?
We previously avoided holding the `total_consistency_lock` while doing crypto operations to build onions. However, now that we've abstracted out the outbound payment logic into a utility module, ensuring the state is consistent at all times is now abstracted away from code authors and reviewers, making it likely to break. Further, because we now call `send_payment_along_path` both with, and without, the `total_consistency_lock`, and because recursive read locks may deadlock, it would now be quite difficult to figure out which paths through `outbound_payment` need the lock and which don't. While it may slow writes somewhat, it's not really worth trying to figure out this mess, instead we just hold the `total_consistency_lock` before going into `outbound_payment` functions.
Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Sadly, a user ended up hitting this deadlock in production in the form of a call to `get_and_clear_pending_msg_events` which holds the `ChannelManager::total_consistency_lock` before calling `process_pending_monitor_events` and eventually `channel_monitor_updated`, which tries to take the same read lock again. Luckily, the fix is trivial, simply remove the redundand read lock in `channel_monitor_updated`. Fixes lightningdevkit#2000
When handling a `ChannelMonitor` update via the new `handle_new_monitor_update` macro, we always call the macro with the `per_peer_state` read lock held and have the macro drop the per-peer state lock. Then, when handling the resulting updates, we may take the `per_peer_state` read lock again in another function. In a coming commit, recursive read locks will be disallowed, so we have to drop the `per_peer_state` read lock before calling additional functions in `handle_new_monitor_update`, which we do here.
Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Thus, we shouldn't have any special handling for allowing recursive read locks, so we simply remove it here.
Taking two instances of the same mutex may be totally fine, but it requires a total lockorder that we cannot (trivially) check. Thus, its generally unsafe to do if we can avoid it. To discourage doing this, here we default to panicing on such locks in our lockorder tests, with a separate lock function added that is clearly labeled "unsafe" to allow doing so when we can guarantee a total lockorder. This requires adapting a number of sites to the new API, including fixing a bug this turned up in `ChannelMonitor`'s `PartialEq` where no lockorder was guaranteed.
as this test often fails on windows which is hard to debug locally for most contributors.
Our lockdep logic (on Windows) identifies a mutex based on which line it was constructed on. Thus, if we have two mutexes constructed on the same line it will generate false positives.
3ac12e6
to
065dc6e
Compare
Fixed the comments and squashed: $ git diff-tree -U1 3ac12e65 065dc6e6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 6f654645c..1045e77ad 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -830,3 +830,3 @@ where
/// Notifier the lock contains sends out a notification when the lock is released.
- pub(crate) total_consistency_lock: RwLock<()>,
+ total_consistency_lock: RwLock<()>,
@@ -7878,6 +7878,3 @@ mod tests {
let session_privs = nodes[0].node.test_add_new_pending_payment(our_payment_hash, Some(payment_secret), payment_id, &mpp_route).unwrap();
- {
- let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
- nodes[0].node.send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
- }
+ nodes[0].node.test_send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
check_added_monitors!(nodes[0], 1);
@@ -7911,6 +7908,3 @@ mod tests {
// Send the second half of the original MPP payment.
- {
- let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
- nodes[0].node.send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
- }
+ nodes[0].node.test_send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
check_added_monitors!(nodes[0], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The
std
RwLock
behavior isplatform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.
Worse, our in-tree
FairRwLock
exhibits this behavior explicitlyon all platforms to avoid the starvation issue.
Sadly, a user ended up hitting this deadlock in production in the
form of a call to
get_and_clear_pending_msg_events
which holdsthe
ChannelManager::total_consistency_lock
before callingprocess_pending_monitor_events
and eventuallychannel_monitor_updated
, which tries to take the same read lockagain.
Fixes #2000
May want to hold this one a little bit because some pending PRs may break/conflict and its easier to fix here than there.