Deadlock when using async monitor persistence #2000

danielgranhao · 2023-02-01T16:54:10Z

I think my team and I stumbled upon a deadlock bug on LDK. It goes like this:

We call ChainMonitor::channel_monitor_updated()
ChannelManager::get_and_clear_pending_msg_events() eventually gets called, takes a read lock on total_consistency_lock and calls process_pending_monitor_events()
One of the pending monitor events is MonitorEvent::Completed, so ChannelManager::channel_monitor_updated() is called, which also takes a read lock on total_consistency_lock

If between the 2 read locks in steps 2. and 3. another concurrent task tries to get a write lock, a deadlock can occur, depending on the queuing policy of the OS. On my machine (MacOS) I never experienced this, but on Linux machines, we get random hangs. It's likely the BackgroundProcessor calling persist_manager, which takes a write lock on total_consistency_lock inside the write() method of ChannelManager.

The text was updated successfully, but these errors were encountered:

andrei-21 · 2023-02-01T17:31:49Z

Stacktraces of threads. Thread 22 gets deadlocked on total_consistency_lock and locks other threads.
gdb.txt

TheBlueMatt · 2023-02-01T17:48:23Z

Ah! Damn, yea, std's RwLock behavior is platform-dependent. It looks like in this case a pending writer blocks readers which would cause an issue. Will think on if we should explicitly exempt same-thread readers in our own wrapper or if we should ban duplicate reads.

andrei-21 · 2023-02-01T17:55:16Z

Or PersistenceNotifierGuard can drop _read_guard explicitly before calling persistence_notifier.notify().

TheBlueMatt · 2023-02-01T18:00:02Z

That wouldn't address this issue. The issue here is that, one one platform, a thread which is requesting the write lock for the total_consistency_lock will block additional threads from taking a read lock. Because its not a reentrant lock, thread (a) is holding the read lock, and then thread (b) tries to acquire the write lock, then thread (a) tries to take a second read lock, which blocks indefinitely. (b) can't wake cause (a) is holding a read lock, but (a) can't make further progress because there's a thread waiting on the write lock.

andrei-21 · 2023-02-01T19:32:51Z

Totally agree with what you said, but I mean another thing.
PersistenceNotifierGuard calls persistence_notifier.notify() while holding read lock on _read_guard. In this instance channel_monitor_updated() constructs PersistenceNotifierGuard (acquires read lock) and passes self.persistence_notifier which also wants to acquire read lock. So this deadlock happens.
The problem is with PersistenceNotifierGuard is that it calls a function which it does not know about while holding read lock. Which is a design issue in my opinion.
Using a reentrant lock will solve this problem with the deadlock, but would not solve the design issue.

TheBlueMatt · 2023-02-01T20:03:45Z

PersistenceNotifierGuard is an RAII lock, this is (at least relatively) clear from the name - Guard, so its ultimately up to the caller to avoid deadlocks, as with any RAII lock. PersistenceNotifierGuard doesn't call any methods which it doesn't know anything about, notify is super well-defined and will not block, so doing it while holding a local is totally fine (and also accepted practice), its also not an issue at all.

andrei-21 · 2023-02-01T20:17:44Z

PersistenceNotifierGuard::optionally_notify() takes persist_check which can be any function. It is a dangerous idea to call unknown function under the lock, that is what I mean.
And this design issue manifests itself into a bug when get_and_clear_pending_msg_events() pass a big lambda function with process_pending_monitor_events() which tries to acquire lock on total_consistency_lock.

Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Sadly, a user ended up hitting this deadlock in production in the form of a call to `get_and_clear_pending_msg_events` which holds the `ChannelManager::total_consistency_lock` before calling `process_pending_monitor_events` and eventually `channel_monitor_updated`, which tries to take the same read lock again. Luckily, the fix is trivial, simply remove the redundand read lock in `channel_monitor_updated`. Fixes lightningdevkit#2000

TheBlueMatt · 2023-02-03T01:12:32Z

The design there is a bit weird, but its not an "unknown function" by any means - the function, as used everywhere, is just an inline closure that exists so that we can take an RAII lock but decide whether to notify event handlers or not during the body, rather than before we lock. If you look at the callsites of it it all looks (and acts) exactly like a simple RAII lock, even if the (internal only) API itself could theoretically be used in some other way.

Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Sadly, a user ended up hitting this deadlock in production in the form of a call to `get_and_clear_pending_msg_events` which holds the `ChannelManager::total_consistency_lock` before calling `process_pending_monitor_events` and eventually `channel_monitor_updated`, which tries to take the same read lock again. Luckily, the fix is trivial, simply remove the redundand read lock in `channel_monitor_updated`. Fixes lightningdevkit#2000

TheBlueMatt added this to the 0.0.114 milestone Feb 1, 2023

TheBlueMatt mentioned this issue Feb 3, 2023

Refuse recursive read locks #2006

Merged

wpaulino closed this as completed in #2006 Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock when using async monitor persistence #2000

Deadlock when using async monitor persistence #2000

danielgranhao commented Feb 1, 2023 •

edited

Loading

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 1, 2023

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 1, 2023

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 1, 2023

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 3, 2023

Deadlock when using async monitor persistence #2000

Deadlock when using async monitor persistence #2000

Comments

danielgranhao commented Feb 1, 2023 • edited Loading

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 1, 2023

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 1, 2023

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 1, 2023

andrei-21 commented Feb 1, 2023

TheBlueMatt commented Feb 3, 2023

danielgranhao commented Feb 1, 2023 •

edited

Loading