Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using async monitor persistence #2000

Closed
danielgranhao opened this issue Feb 1, 2023 · 8 comments · Fixed by #2006
Closed

Deadlock when using async monitor persistence #2000

danielgranhao opened this issue Feb 1, 2023 · 8 comments · Fixed by #2006
Milestone

Comments

@danielgranhao
Copy link
Contributor

danielgranhao commented Feb 1, 2023

I think my team and I stumbled upon a deadlock bug on LDK. It goes like this:

  1. We call ChainMonitor::channel_monitor_updated()
  2. ChannelManager::get_and_clear_pending_msg_events() eventually gets called, takes a read lock on total_consistency_lock and calls process_pending_monitor_events()
  3. One of the pending monitor events is MonitorEvent::Completed, so ChannelManager::channel_monitor_updated() is called, which also takes a read lock on total_consistency_lock

If between the 2 read locks in steps 2. and 3. another concurrent task tries to get a write lock, a deadlock can occur, depending on the queuing policy of the OS. On my machine (MacOS) I never experienced this, but on Linux machines, we get random hangs. It's likely the BackgroundProcessor calling persist_manager, which takes a write lock on total_consistency_lock inside the write() method of ChannelManager.

@andrei-21
Copy link
Contributor

Stacktraces of threads. Thread 22 gets deadlocked on total_consistency_lock and locks other threads.
gdb.txt

@TheBlueMatt TheBlueMatt added this to the 0.0.114 milestone Feb 1, 2023
@TheBlueMatt
Copy link
Collaborator

Ah! Damn, yea, std's RwLock behavior is platform-dependent. It looks like in this case a pending writer blocks readers which would cause an issue. Will think on if we should explicitly exempt same-thread readers in our own wrapper or if we should ban duplicate reads.

@andrei-21
Copy link
Contributor

Or PersistenceNotifierGuard can drop _read_guard explicitly before calling persistence_notifier.notify().

@TheBlueMatt
Copy link
Collaborator

That wouldn't address this issue. The issue here is that, one one platform, a thread which is requesting the write lock for the total_consistency_lock will block additional threads from taking a read lock. Because its not a reentrant lock, thread (a) is holding the read lock, and then thread (b) tries to acquire the write lock, then thread (a) tries to take a second read lock, which blocks indefinitely. (b) can't wake cause (a) is holding a read lock, but (a) can't make further progress because there's a thread waiting on the write lock.

@andrei-21
Copy link
Contributor

Totally agree with what you said, but I mean another thing.
PersistenceNotifierGuard calls persistence_notifier.notify() while holding read lock on _read_guard. In this instance channel_monitor_updated() constructs PersistenceNotifierGuard (acquires read lock) and passes self.persistence_notifier which also wants to acquire read lock. So this deadlock happens.
The problem is with PersistenceNotifierGuard is that it calls a function which it does not know about while holding read lock. Which is a design issue in my opinion.
Using a reentrant lock will solve this problem with the deadlock, but would not solve the design issue.

@TheBlueMatt
Copy link
Collaborator

PersistenceNotifierGuard is an RAII lock, this is (at least relatively) clear from the name - Guard, so its ultimately up to the caller to avoid deadlocks, as with any RAII lock. PersistenceNotifierGuard doesn't call any methods which it doesn't know anything about, notify is super well-defined and will not block, so doing it while holding a local is totally fine (and also accepted practice), its also not an issue at all.

@andrei-21
Copy link
Contributor

PersistenceNotifierGuard::optionally_notify() takes persist_check which can be any function. It is a dangerous idea to call unknown function under the lock, that is what I mean.
And this design issue manifests itself into a bug when get_and_clear_pending_msg_events() pass a big lambda function with process_pending_monitor_events() which tries to acquire lock on total_consistency_lock.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 3, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
@TheBlueMatt
Copy link
Collaborator

The design there is a bit weird, but its not an "unknown function" by any means - the function, as used everywhere, is just an inline closure that exists so that we can take an RAII lock but decide whether to notify event handlers or not during the body, rather than before we lock. If you look at the callsites of it it all looks (and acts) exactly like a simple RAII lock, even if the (internal only) API itself could theoretically be used in some other way.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 4, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 22, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 24, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 24, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 26, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 27, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Feb 28, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
optout21 pushed a commit to optout21/rust-lightning that referenced this issue Jul 24, 2023
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The `std` `RwLock` behavior is
platform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.

Worse, our in-tree `FairRwLock` exhibits this behavior explicitly
on all platforms to avoid the starvation issue.

Sadly, a user ended up hitting this deadlock in production in the
form of a call to `get_and_clear_pending_msg_events` which holds
the `ChannelManager::total_consistency_lock` before calling
`process_pending_monitor_events` and eventually
`channel_monitor_updated`, which tries to take the same read lock
again.

Luckily, the fix is trivial, simply remove the redundand read lock
in `channel_monitor_updated`.

Fixes lightningdevkit#2000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants