Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refuse recursive read locks #2006

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The std RwLock behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree FairRwLock exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to get_and_clear_pending_msg_events which holds
the ChannelManager::total_consistency_lock before calling
process_pending_monitor_events and eventually
channel_monitor_updated, which tries to take the same read lock
again.

Fixes #2000

May want to hold this one a little bit because some pending PRs may break/conflict and its easier to fix here than there.

@TheBlueMatt TheBlueMatt added this to the 0.0.114 milestone Feb 3, 2023
@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2023

Codecov Report

Base: 87.25% // Head: 87.22% // Decreases project coverage by -0.04% ⚠️

Coverage data is based on head (3ac12e6) compared to base (b5e5435).
Patch coverage: 92.36% of modified lines in pull request are covered.

❗ Current head 3ac12e6 differs from pull request most recent head 065dc6e. Consider uploading reports for the commit 065dc6e to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2006      +/-   ##
==========================================
- Coverage   87.25%   87.22%   -0.04%     
==========================================
  Files         100      100              
  Lines       44480    44505      +25     
  Branches    44480    44505      +25     
==========================================
+ Hits        38810    38818       +8     
- Misses       5670     5687      +17     
Impacted Files Coverage Δ
lightning/src/sync/fairrwlock.rs 75.00% <ø> (ø)
lightning/src/sync/test_lockorder_checks.rs 100.00% <ø> (ø)
lightning/src/chain/channelmonitor.rs 89.58% <83.33%> (-0.05%) ⬇️
lightning/src/sync/debug_sync.rs 81.20% <83.33%> (-1.11%) ⬇️
lightning/src/ln/channelmanager.rs 86.61% <91.52%> (+0.07%) ⬆️
lightning/src/ln/payment_tests.rs 95.52% <92.85%> (-0.14%) ⬇️
lightning/src/ln/chanmon_update_fail_tests.rs 97.17% <100.00%> (-0.06%) ⬇️
lightning/src/ln/functional_test_utils.rs 88.43% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.14% <100.00%> (-0.15%) ⬇️
lightning/src/routing/utxo.rs 89.21% <100.00%> (ø)
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 3adef72 to 46428a7 Compare February 4, 2023 00:51
Copy link
Contributor

@dunxen dunxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but will make sure to review it after rebased once conflicting things land :)

self.inner.read().map(|guard| RwLockReadGuard { lock: self, guard, first_lock }).map_err(|_| ())
// Note that while we could be taking a recursive read lock here, Rust's `RwLock` may
// deadlock trying to take a second read lock if another thread is waiting on the write
// lock. Its platform dependent (but our in-tree `FairRwLock` guarantees this behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very tiny nit: s/Its/It's/

@dunxen dunxen self-assigned this Feb 7, 2023
@wpaulino
Copy link
Contributor

Needs a rebase

@TheBlueMatt TheBlueMatt marked this pull request as draft February 14, 2023 23:27
@TheBlueMatt
Copy link
Collaborator Author

Converting to draft until at least #1897 because it'll need to be adapted for that, I think.

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 46428a7 to a1bb7ca Compare February 22, 2023 23:01
@TheBlueMatt TheBlueMatt marked this pull request as ready for review February 22, 2023 23:01
@TheBlueMatt
Copy link
Collaborator Author

Rebased, fixed some issues and added another check which turned up one more bug (see last commit).

@TheBlueMatt
Copy link
Collaborator Author

Oops, forgot to update the no-std lock file.

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 267f0a6 to ba66a76 Compare February 23, 2023 18:52
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is still failing with some lifetime errors.

lightning/src/ln/functional_tests.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Show resolved Hide resolved
lightning/src/sync/debug_sync.rs Outdated Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from ba66a76 to 1877714 Compare February 24, 2023 05:33
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after squash

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 1877714 to 18b1e86 Compare February 24, 2023 20:30
@TheBlueMatt
Copy link
Collaborator Author

Squashed without changes, but pushed one additional commit to export RUST_BACKTRACE=1 in CI because windows is failing and I can't really otherwise debug it locally.

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch 3 times, most recently from d459767 to a91f5e6 Compare February 26, 2023 20:23
@TheBlueMatt
Copy link
Collaborator Author

Fixed a silent rebase conflict, see the last commit.

Copy link
Contributor

@dunxen dunxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the one mutex per line can be a little surprising to contributors. I wonder if there's any easy way to enforce this/notify an author.

@dunxen
Copy link
Contributor

dunxen commented Feb 27, 2023

I see also test transaction sync clients on 1.59 rustc, but now we get: error: package base64ct v1.6.0 cannot be built because it requires rustc 1.60 or newer, while the currently active rustc version is 1.59.0. Seems to be a sync client dependency though.

@TheBlueMatt
Copy link
Collaborator Author

Yea, looks like lightning-transaction-sync is causing issues. Will fix at #2055

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from a91f5e6 to b8e36a2 Compare February 27, 2023 17:48
@TheBlueMatt
Copy link
Collaborator Author

Hmm the one mutex per line can be a little surprising to contributors. I wonder if there's any easy way to enforce this/notify an author.

Good point! I pushed a commit to assert this on non-Windows platforms (it will still generate false-positives on Windows, but at least on other platforms you'll get a helpful assertion failure error message).

@wpaulino
Copy link
Contributor

Found one more case, and also modified the log to include the line number:

diff --git a/lightning/src/chain/channelmonitor.rs b/lightning/src/chain/channelmonitor.rs
index 44390496..64a458d2 100644
--- a/lightning/src/chain/channelmonitor.rs
+++ b/lightning/src/chain/channelmonitor.rs
@@ -4074,7 +4074,10 @@ mod tests {
 	fn test_prune_preimages() {
 		let secp_ctx = Secp256k1::new();
 		let logger = Arc::new(TestLogger::new());
-		let broadcaster = Arc::new(TestBroadcaster{txn_broadcasted: Mutex::new(Vec::new()), blocks: Arc::new(Mutex::new(Vec::new()))});
+		let broadcaster = Arc::new(TestBroadcaster{
+			txn_broadcasted: Mutex::new(Vec::new()),
+			blocks: Arc::new(Mutex::new(Vec::new()))
+		});
 		let fee_estimator = TestFeeEstimator { sat_per_kw: Mutex::new(253) };
 
 		let dummy_key = PublicKey::from_secret_key(&secp_ctx, &SecretKey::from_slice(&[42; 32]).unwrap());
diff --git a/lightning/src/sync/debug_sync.rs b/lightning/src/sync/debug_sync.rs
index 76b139f7..76a7bb41 100644
--- a/lightning/src/sync/debug_sync.rs
+++ b/lightning/src/sync/debug_sync.rs
@@ -110,11 +110,11 @@ impl LockMetadata {
 			let lock_constr_location = get_construction_location(&res._lock_construction_bt);
 			LOCKS_INIT.call_once(|| { unsafe { LOCKS = Some(StdMutex::new(HashMap::new())); } });
 			let mut locks = unsafe { LOCKS.as_ref() }.unwrap().lock().unwrap();
-			match locks.entry(lock_constr_location.0) {
+			match locks.entry(lock_constr_location.0.clone()) {
 				hash_map::Entry::Occupied(e) => {
 					assert_eq!(lock_constr_location.1,
 						get_construction_location(&e.get()._lock_construction_bt).1,
-						"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives.");
+						"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives: {}", lock_constr_location.0);
 					return Arc::clone(e.get())
 				},
 				hash_map::Entry::Vacant(e) => { e.insert(Arc::clone(&res)); },

@TheBlueMatt
Copy link
Collaborator Author

Thanks.

@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from b8e36a2 to 3ac12e6 Compare February 27, 2023 22:45
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, feel free to squash everything on your next push

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
hash_map::Entry::Occupied(e) => {
assert_eq!(lock_constr_location.1,
get_construction_location(&e.get()._lock_construction_bt).1,
"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we do have to clone it, I still think it's useful to include the actual line where this happens, even if you can simply go through all of your changes to find it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backtrace in the panic message should have it?

We previously avoided holding the `total_consistency_lock` while
doing crypto operations to build onions. However, now that we've
abstracted out the outbound payment logic into a utility module,
ensuring the state is consistent at all times is now abstracted
away from code authors and reviewers, making it likely to break.

Further, because we now call `send_payment_along_path` both with,
and without, the `total_consistency_lock`, and because recursive
read locks may deadlock, it would now be quite difficult to figure
out which paths through `outbound_payment` need the lock and which
don't.

While it may slow writes somewhat, it's not really worth trying to
figure out this mess, instead we just hold the
`total_consistency_lock` before going into `outbound_payment`
functions.
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
When handling a `ChannelMonitor` update via the new
`handle_new_monitor_update` macro, we always call the macro with
the `per_peer_state` read lock held and have the macro drop the
per-peer state lock. Then, when handling the resulting updates, we
may take the `per_peer_state` read lock again in another function.

In a coming commit, recursive read locks will be disallowed, so we
have to drop the `per_peer_state` read lock before calling
additional functions in `handle_new_monitor_update`, which we do
here.
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Thus, we shouldn't have any special handling for allowing recursive
read locks, so we simply remove it here.
Taking two instances of the same mutex may be totally fine, but it
requires a total lockorder that we cannot (trivially) check. Thus,
its generally unsafe to do if we can avoid it.

To discourage doing this, here we default to panicing on such locks
in our lockorder tests, with a separate lock function added that is
clearly labeled "unsafe" to allow doing so when we can guarantee a
total lockorder.

This requires adapting a number of sites to the new API, including
fixing a bug this turned up in `ChannelMonitor`'s `PartialEq` where
no lockorder was guaranteed.
as this test often fails on windows which is hard to debug locally
for most contributors.
Our lockdep logic (on Windows) identifies a mutex based on which
line it was constructed on. Thus, if we have two mutexes
constructed on the same line it will generate false positives.
@TheBlueMatt TheBlueMatt force-pushed the 2023-02-no-recursive-read-locks branch from 3ac12e6 to 065dc6e Compare February 28, 2023 01:06
@TheBlueMatt
Copy link
Collaborator Author

Fixed the comments and squashed:

$ git diff-tree -U1 3ac12e65 065dc6e6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 6f654645c..1045e77ad 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -830,3 +830,3 @@ where
 	/// Notifier the lock contains sends out a notification when the lock is released.
-	pub(crate) total_consistency_lock: RwLock<()>,
+	total_consistency_lock: RwLock<()>,
 
@@ -7878,6 +7878,3 @@ mod tests {
 		let session_privs = nodes[0].node.test_add_new_pending_payment(our_payment_hash, Some(payment_secret), payment_id, &mpp_route).unwrap();
-		{
-			let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
-			nodes[0].node.send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
-		}
+		nodes[0].node.test_send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
 		check_added_monitors!(nodes[0], 1);
@@ -7911,6 +7908,3 @@ mod tests {
 		// Send the second half of the original MPP payment.
-		{
-			let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
-			nodes[0].node.send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
-		}
+		nodes[0].node.test_send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
 		check_added_monitors!(nodes[0], 1);

Copy link
Contributor

@dunxen dunxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dunxen dunxen removed their assignment Feb 28, 2023
@wpaulino wpaulino merged commit 8311581 into lightningdevkit:main Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock when using async monitor persistence
6 participants