Refuse recursive read locks #2006

TheBlueMatt · 2023-02-03T01:10:59Z

Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The std RwLock behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree FairRwLock exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to get_and_clear_pending_msg_events which holds
the ChannelManager::total_consistency_lock before calling
process_pending_monitor_events and eventually
channel_monitor_updated, which tries to take the same read lock
again.

Fixes #2000

May want to hold this one a little bit because some pending PRs may break/conflict and its easier to fix here than there.

codecov-commenter · 2023-02-03T03:19:51Z

Codecov Report

Base: 87.25% // Head: 87.22% // Decreases project coverage by -0.04% ⚠️

Coverage data is based on head (3ac12e6) compared to base (b5e5435).
Patch coverage: 92.36% of modified lines in pull request are covered.

❗ Current head 3ac12e6 differs from pull request most recent head 065dc6e. Consider uploading reports for the commit 065dc6e to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2006      +/-   ##
==========================================
- Coverage   87.25%   87.22%   -0.04%     
==========================================
  Files         100      100              
  Lines       44480    44505      +25     
  Branches    44480    44505      +25     
==========================================
+ Hits        38810    38818       +8     
- Misses       5670     5687      +17

Impacted Files	Coverage Δ
lightning/src/sync/fairrwlock.rs	`75.00% <ø> (ø)`
lightning/src/sync/test_lockorder_checks.rs	`100.00% <ø> (ø)`
lightning/src/chain/channelmonitor.rs	`89.58% <83.33%> (-0.05%)`	⬇️
lightning/src/sync/debug_sync.rs	`81.20% <83.33%> (-1.11%)`	⬇️
lightning/src/ln/channelmanager.rs	`86.61% <91.52%> (+0.07%)`	⬆️
lightning/src/ln/payment_tests.rs	`95.52% <92.85%> (-0.14%)`	⬇️
lightning/src/ln/chanmon_update_fail_tests.rs	`97.17% <100.00%> (-0.06%)`	⬇️
lightning/src/ln/functional_test_utils.rs	`88.43% <100.00%> (ø)`
lightning/src/ln/functional_tests.rs	`97.14% <100.00%> (-0.15%)`	⬇️
lightning/src/routing/utxo.rs	`89.21% <100.00%> (ø)`
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

dunxen

Looks good, but will make sure to review it after rebased once conflicting things land :)

dunxen · 2023-02-07T12:28:03Z

lightning/src/sync/debug_sync.rs

-		self.inner.read().map(|guard| RwLockReadGuard { lock: self, guard, first_lock }).map_err(|_| ())
+		// Note that while we could be taking a recursive read lock here, Rust's `RwLock` may
+		// deadlock trying to take a second read lock if another thread is waiting on the write
+		// lock. Its platform dependent (but our in-tree `FairRwLock` guarantees this behavior).


very tiny nit: s/Its/It's/

wpaulino · 2023-02-14T19:08:58Z

Needs a rebase

TheBlueMatt · 2023-02-14T23:28:02Z

Converting to draft until at least #1897 because it'll need to be adapted for that, I think.

TheBlueMatt · 2023-02-22T23:02:06Z

Rebased, fixed some issues and added another check which turned up one more bug (see last commit).

TheBlueMatt · 2023-02-23T18:47:39Z

Oops, forgot to update the no-std lock file.

wpaulino

CI is still failing with some lifetime errors.

lightning/src/ln/functional_tests.rs

lightning/src/ln/channelmanager.rs

lightning/src/sync/debug_sync.rs

wpaulino

LGTM after squash

TheBlueMatt · 2023-02-24T20:30:56Z

Squashed without changes, but pushed one additional commit to export RUST_BACKTRACE=1 in CI because windows is failing and I can't really otherwise debug it locally.

TheBlueMatt · 2023-02-26T20:23:58Z

Fixed a silent rebase conflict, see the last commit.

dunxen

Hmm the one mutex per line can be a little surprising to contributors. I wonder if there's any easy way to enforce this/notify an author.

dunxen · 2023-02-27T08:14:30Z

I see also test transaction sync clients on 1.59 rustc, but now we get: error: package base64ct v1.6.0 cannot be built because it requires rustc 1.60 or newer, while the currently active rustc version is 1.59.0. Seems to be a sync client dependency though.

TheBlueMatt · 2023-02-27T17:34:42Z

Yea, looks like lightning-transaction-sync is causing issues. Will fix at #2055

TheBlueMatt · 2023-02-27T17:48:45Z

Hmm the one mutex per line can be a little surprising to contributors. I wonder if there's any easy way to enforce this/notify an author.

Good point! I pushed a commit to assert this on non-Windows platforms (it will still generate false-positives on Windows, but at least on other platforms you'll get a helpful assertion failure error message).

wpaulino · 2023-02-27T20:24:02Z

Found one more case, and also modified the log to include the line number:

diff --git a/lightning/src/chain/channelmonitor.rs b/lightning/src/chain/channelmonitor.rs
index 44390496..64a458d2 100644
--- a/lightning/src/chain/channelmonitor.rs
+++ b/lightning/src/chain/channelmonitor.rs
@@ -4074,7 +4074,10 @@ mod tests {
 	fn test_prune_preimages() {
 		let secp_ctx = Secp256k1::new();
 		let logger = Arc::new(TestLogger::new());
-		let broadcaster = Arc::new(TestBroadcaster{txn_broadcasted: Mutex::new(Vec::new()), blocks: Arc::new(Mutex::new(Vec::new()))});
+		let broadcaster = Arc::new(TestBroadcaster{
+			txn_broadcasted: Mutex::new(Vec::new()),
+			blocks: Arc::new(Mutex::new(Vec::new()))
+		});
 		let fee_estimator = TestFeeEstimator { sat_per_kw: Mutex::new(253) };
 
 		let dummy_key = PublicKey::from_secret_key(&secp_ctx, &SecretKey::from_slice(&[42; 32]).unwrap());
diff --git a/lightning/src/sync/debug_sync.rs b/lightning/src/sync/debug_sync.rs
index 76b139f7..76a7bb41 100644
--- a/lightning/src/sync/debug_sync.rs
+++ b/lightning/src/sync/debug_sync.rs
@@ -110,11 +110,11 @@ impl LockMetadata {
 			let lock_constr_location = get_construction_location(&res._lock_construction_bt);
 			LOCKS_INIT.call_once(|| { unsafe { LOCKS = Some(StdMutex::new(HashMap::new())); } });
 			let mut locks = unsafe { LOCKS.as_ref() }.unwrap().lock().unwrap();
-			match locks.entry(lock_constr_location.0) {
+			match locks.entry(lock_constr_location.0.clone()) {
 				hash_map::Entry::Occupied(e) => {
 					assert_eq!(lock_constr_location.1,
 						get_construction_location(&e.get()._lock_construction_bt).1,
-						"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives.");
+						"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives: {}", lock_constr_location.0);
 					return Arc::clone(e.get())
 				},
 				hash_map::Entry::Vacant(e) => { e.insert(Arc::clone(&res)); },

TheBlueMatt · 2023-02-27T22:45:07Z

Thanks.

wpaulino

Basically LGTM, feel free to squash everything on your next push

lightning/src/ln/channelmanager.rs

wpaulino · 2023-02-28T00:49:02Z

lightning/src/sync/debug_sync.rs

+				hash_map::Entry::Occupied(e) => {
+					assert_eq!(lock_constr_location.1,
+						get_construction_location(&e.get()._lock_construction_bt).1,
+						"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives.");


While we do have to clone it, I still think it's useful to include the actual line where this happens, even if you can simply go through all of your changes to find it.

The backtrace in the panic message should have it?

We previously avoided holding the `total_consistency_lock` while doing crypto operations to build onions. However, now that we've abstracted out the outbound payment logic into a utility module, ensuring the state is consistent at all times is now abstracted away from code authors and reviewers, making it likely to break. Further, because we now call `send_payment_along_path` both with, and without, the `total_consistency_lock`, and because recursive read locks may deadlock, it would now be quite difficult to figure out which paths through `outbound_payment` need the lock and which don't. While it may slow writes somewhat, it's not really worth trying to figure out this mess, instead we just hold the `total_consistency_lock` before going into `outbound_payment` functions.

Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Sadly, a user ended up hitting this deadlock in production in the form of a call to `get_and_clear_pending_msg_events` which holds the `ChannelManager::total_consistency_lock` before calling `process_pending_monitor_events` and eventually `channel_monitor_updated`, which tries to take the same read lock again. Luckily, the fix is trivial, simply remove the redundand read lock in `channel_monitor_updated`. Fixes lightningdevkit#2000

When handling a `ChannelMonitor` update via the new `handle_new_monitor_update` macro, we always call the macro with the `per_peer_state` read lock held and have the macro drop the per-peer state lock. Then, when handling the resulting updates, we may take the `per_peer_state` read lock again in another function. In a coming commit, recursive read locks will be disallowed, so we have to drop the `per_peer_state` read lock before calling additional functions in `handle_new_monitor_update`, which we do here.

Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Thus, we shouldn't have any special handling for allowing recursive read locks, so we simply remove it here.

Taking two instances of the same mutex may be totally fine, but it requires a total lockorder that we cannot (trivially) check. Thus, its generally unsafe to do if we can avoid it. To discourage doing this, here we default to panicing on such locks in our lockorder tests, with a separate lock function added that is clearly labeled "unsafe" to allow doing so when we can guarantee a total lockorder. This requires adapting a number of sites to the new API, including fixing a bug this turned up in `ChannelMonitor`'s `PartialEq` where no lockorder was guaranteed.

as this test often fails on windows which is hard to debug locally for most contributors.

Our lockdep logic (on Windows) identifies a mutex based on which line it was constructed on. Thus, if we have two mutexes constructed on the same line it will generate false positives.

TheBlueMatt · 2023-02-28T01:07:01Z

Fixed the comments and squashed:

$ git diff-tree -U1 3ac12e65 065dc6e6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 6f654645c..1045e77ad 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -830,3 +830,3 @@ where
 	/// Notifier the lock contains sends out a notification when the lock is released.
-	pub(crate) total_consistency_lock: RwLock<()>,
+	total_consistency_lock: RwLock<()>,
 
@@ -7878,6 +7878,3 @@ mod tests {
 		let session_privs = nodes[0].node.test_add_new_pending_payment(our_payment_hash, Some(payment_secret), payment_id, &mpp_route).unwrap();
-		{
-			let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
-			nodes[0].node.send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
-		}
+		nodes[0].node.test_send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
 		check_added_monitors!(nodes[0], 1);
@@ -7911,6 +7908,3 @@ mod tests {
 		// Send the second half of the original MPP payment.
-		{
-			let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
-			nodes[0].node.send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
-		}
+		nodes[0].node.test_send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
 		check_added_monitors!(nodes[0], 1);

dunxen

LGTM

TheBlueMatt added this to the 0.0.114 milestone Feb 3, 2023

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 3adef72 to 46428a7 Compare February 4, 2023 00:51

dunxen reviewed Feb 7, 2023

View reviewed changes

dunxen self-assigned this Feb 7, 2023

TheBlueMatt marked this pull request as draft February 14, 2023 23:27

wpaulino added the blocked on dependent pr label Feb 14, 2023

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 46428a7 to a1bb7ca Compare February 22, 2023 23:01

TheBlueMatt marked this pull request as ready for review February 22, 2023 23:01

valentinewallace added Seeking Code Review and removed blocked on dependent pr labels Feb 22, 2023

wpaulino self-requested a review February 22, 2023 23:08

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from a1bb7ca to 267f0a6 Compare February 23, 2023 01:01

jkczyz removed the Seeking Code Review label Feb 23, 2023

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 267f0a6 to ba66a76 Compare February 23, 2023 18:52

wpaulino reviewed Feb 24, 2023

View reviewed changes

lightning/src/ln/functional_tests.rs Outdated Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

lightning/src/sync/debug_sync.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from ba66a76 to 1877714 Compare February 24, 2023 05:33

wpaulino reviewed Feb 24, 2023

View reviewed changes

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 1877714 to 18b1e86 Compare February 24, 2023 20:30

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch 3 times, most recently from d459767 to a91f5e6 Compare February 26, 2023 20:23

dunxen reviewed Feb 27, 2023

View reviewed changes

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from a91f5e6 to b8e36a2 Compare February 27, 2023 17:48

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from b8e36a2 to 3ac12e6 Compare February 27, 2023 22:45

wpaulino reviewed Feb 28, 2023

View reviewed changes

TheBlueMatt added 7 commits February 28, 2023 01:06

Export RUST_BACKTRACE=1 in --feature backtrace CI test

22662ef

as this test often fails on windows which is hard to debug locally for most contributors.

Make sure individual mutexes are constructed on different lines

065dc6e

Our lockdep logic (on Windows) identifies a mutex based on which line it was constructed on. Thus, if we have two mutexes constructed on the same line it will generate false positives.

TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 3ac12e6 to 065dc6e Compare February 28, 2023 01:06

dunxen approved these changes Feb 28, 2023

View reviewed changes

dunxen removed their assignment Feb 28, 2023

wpaulino approved these changes Feb 28, 2023

View reviewed changes

wpaulino merged commit 8311581 into lightningdevkit:main Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refuse recursive read locks #2006

Refuse recursive read locks #2006

TheBlueMatt commented Feb 3, 2023

codecov-commenter commented Feb 3, 2023 •

edited

Loading

dunxen left a comment

dunxen Feb 7, 2023

wpaulino commented Feb 14, 2023

TheBlueMatt commented Feb 14, 2023

TheBlueMatt commented Feb 22, 2023

TheBlueMatt commented Feb 23, 2023

wpaulino left a comment

wpaulino left a comment

TheBlueMatt commented Feb 24, 2023

TheBlueMatt commented Feb 26, 2023

dunxen left a comment •

edited

Loading

dunxen commented Feb 27, 2023 •

edited

Loading

TheBlueMatt commented Feb 27, 2023

TheBlueMatt commented Feb 27, 2023

wpaulino commented Feb 27, 2023

TheBlueMatt commented Feb 27, 2023

wpaulino left a comment

wpaulino Feb 28, 2023

TheBlueMatt Feb 28, 2023

TheBlueMatt commented Feb 28, 2023

dunxen left a comment

Refuse recursive read locks #2006

Refuse recursive read locks #2006

Conversation

TheBlueMatt commented Feb 3, 2023

codecov-commenter commented Feb 3, 2023 • edited Loading

Codecov Report

dunxen left a comment

Choose a reason for hiding this comment

dunxen Feb 7, 2023

Choose a reason for hiding this comment

wpaulino commented Feb 14, 2023

TheBlueMatt commented Feb 14, 2023

TheBlueMatt commented Feb 22, 2023

TheBlueMatt commented Feb 23, 2023

wpaulino left a comment

Choose a reason for hiding this comment

wpaulino left a comment

Choose a reason for hiding this comment

TheBlueMatt commented Feb 24, 2023

TheBlueMatt commented Feb 26, 2023

dunxen left a comment • edited Loading

Choose a reason for hiding this comment

dunxen commented Feb 27, 2023 • edited Loading

TheBlueMatt commented Feb 27, 2023

TheBlueMatt commented Feb 27, 2023

wpaulino commented Feb 27, 2023

TheBlueMatt commented Feb 27, 2023

wpaulino left a comment

Choose a reason for hiding this comment

wpaulino Feb 28, 2023

Choose a reason for hiding this comment

TheBlueMatt Feb 28, 2023

Choose a reason for hiding this comment

TheBlueMatt commented Feb 28, 2023

dunxen left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 3, 2023 •

edited

Loading

dunxen left a comment •

edited

Loading

dunxen commented Feb 27, 2023 •

edited

Loading