RGS randomly hangs after sometime #32

andrei-21 · 2023-05-08T13:15:04Z

Current Behavior:

RGS randomly hangs after sometime.

Expected Behavior:

RGS does not hang if the database is slow.

Steps To Reproduce:

Build a version with 2 worker threads (or run on a machine with two cores)
#[tokio::main(flavor = "multi_thread", worker_threads = 2)]
Run on testnet with

LN_PEERS=02eadbd9e7557375161df8b646776a547c5cbc2e95b3071ec81553f8ec2cea3b8c@18.191.253.246:9735,03bae2db4b57738c1ec1ffa1c5e5a4423968cc592b3b39cddf7d495e72919d6431@18.202.91.172:9735,038863cf8ab91046230f561cd5b386cbff8309fa02e3f0c3ed161a3aeb64a643b9@203.132.94.196:9735

(more peers — higher the changes to hangs)

What Happens

The problem is with mpsc::channel(100) for GossipMessage. There is a task which receives elements in GossipPersister and tasks which send elements in GossipRouter.
GossipRouter uses try_send() method to send without blocking the thread, but if it fails (when the channel is full) it uses blocking send(), what blocks the thread. But this thread is a tokio executor thread, if all tokio executor threads get blocked in GossipRouter::new_channel_announcement() or GossipRouter::new_channel_update() then the task to receive elements from the channel will never be executed. Deadlock.

Note that there is a code in GossipRouter which tries to minimize the risk of deadlock, but unfortunate it does not eliminate it:

tokio::task::block_in_place(move || { tokio::runtime::Handle::current().block_on(async move {
				self.sender.send(gossip_message).await.unwrap();
			})});

The text was updated successfully, but these errors were encountered:

tnull · 2023-05-08T13:52:44Z

Note that there is a code in GossipRouter which tries to minimize the risk of deadlock, but unfortunate it does not eliminate it:
tokio::task::block_in_place(move || { tokio::runtime::Handle::current().block_on(async move {
				self.sender.send(gossip_message).await.unwrap();
			})});

Ah yeah, this is bound to fail, we shouldn't ever block_on anywhere. Note that the "mitigation" doesn't work as block_in_place is a no-op outside of a tokio runtime context. To quote the surprisingly brief tokio docs on the matter:

On the other hand, calling the function outside a runtime is allowed. In this case, block_in_place just calls the provided closure normally.

TheBlueMatt · 2023-05-08T23:11:14Z

Wait, I'm confused on how this causes a deadlock, the mitigation in this case is called from a tokio thread (there are no non-tokio threads in the RGS server) so it should happily create a new thread for the duration of the block, and I don't immediately see any deadlock in the GossipPersister receive, but I may have missed one there.

If thre's a thread currently handling `PeerManager` events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread). This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock. Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled. While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking. Fixes lightningdevkit/rapid-gossip-sync-server#32

TheBlueMatt · 2023-05-09T00:37:19Z

I believe this should be fixed by lightningdevkit/rust-lightning#2280 (untested, though). I didn't manage to reproduce this issue with two threads, but with one thread I hit what looks like it should be fixed by that.

If thre's a thread currently handling `PeerManager` events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread). This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock. Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled. While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking. Fixes lightningdevkit/rapid-gossip-sync-server#32

tnull · 2023-05-09T10:06:59Z

Wait, I'm confused on how this causes a deadlock, the mitigation in this case is called from a tokio thread (there are no non-tokio threads in the RGS server) so it should happily create a new thread for the duration of the block, and I don't immediately see any deadlock in the GossipPersister receive, but I may have missed one there.

Ah, hum, good point, we're indeed in a tokio context in this case.

andrei-21 · 2023-05-09T15:22:57Z

Wait, I'm confused on how this causes a deadlock, the mitigation in this case is called from a tokio thread (there are no non-tokio threads in the RGS server) so it should happily create a new thread for the duration of the block, and I don't immediately see any deadlock in the GossipPersister receive, but I may have missed one there.

Hm, you are right. My interpretation was wrong. But my fix is still applicable.

I checked the backtraces and indeed there is a thread waiting for a lock on self.event_processing_lock.
Let me try your fix if it works on the machine where I am experiencing the problem.

andrei-21 · 2023-05-10T10:21:48Z

I built a version with @TheBlueMatt fix and it solves the problem!

TheBlueMatt · 2023-05-10T15:22:18Z

Cool gonna go ahead and close this since github doesn't do the magic close-on-merge thing across repos.

TheBlueMatt · 2023-05-10T15:22:47Z

Oh wait, maybe it does!

tnull · 2023-05-10T15:32:34Z

And why reopened now? 😁

TheBlueMatt · 2023-05-10T21:30:16Z

Cause I want to find out if GitHub magics here:)

@andrei-21

If thre's a thread currently handling `PeerManager` events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread). This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock. Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled. While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking. Fixes lightningdevkit/rapid-gossip-sync-server#32 Atomic lock simplification suggestion from @andrei-21

@andrei-21

If thre's a thread currently handling `PeerManager` events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread). This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock. Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled. While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking. Fixes lightningdevkit/rapid-gossip-sync-server#32 Atomic lock simplification suggestion from @andrei-21

This includes a fix for the deadlock in lightningdevkit#32.

@andrei-21

If thre's a thread currently handling `PeerManager` events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread). This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock. Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled. While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking. Fixes lightningdevkit/rapid-gossip-sync-server#32 Atomic lock simplification suggestion from @andrei-21

andrei-21 mentioned this issue May 8, 2023

Replace channel(100) with unbounded_channel() #33

Closed

TheBlueMatt mentioned this issue May 9, 2023

Never block a thread on the PeerManager event handling lock lightningdevkit/rust-lightning#2280

Merged

TheBlueMatt closed this as completed May 10, 2023

TheBlueMatt reopened this May 10, 2023

TheBlueMatt closed this as completed May 26, 2023

marctyndel mentioned this issue Jul 2, 2023

Add config for worker threads #41

Closed

TheBlueMatt added a commit to TheBlueMatt/rapid-gossip-sync-server that referenced this issue Jul 16, 2023

Upgrade to LDK 0.0.116-alpha1

e302d51

This includes a fix for the deadlock in lightningdevkit#32.

TheBlueMatt added a commit to TheBlueMatt/rapid-gossip-sync-server that referenced this issue Jul 16, 2023

Upgrade to LDK 0.0.116-alpha1

70254ea

This includes a fix for the deadlock in lightningdevkit#32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RGS randomly hangs after sometime #32

RGS randomly hangs after sometime #32

andrei-21 commented May 8, 2023

tnull commented May 8, 2023

TheBlueMatt commented May 8, 2023

TheBlueMatt commented May 9, 2023

tnull commented May 9, 2023 •

edited

Loading

andrei-21 commented May 9, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

TheBlueMatt commented May 10, 2023

tnull commented May 10, 2023

TheBlueMatt commented May 10, 2023

RGS randomly hangs after sometime #32

RGS randomly hangs after sometime #32

Comments

andrei-21 commented May 8, 2023

Current Behavior:

Expected Behavior:

Steps To Reproduce:

What Happens

tnull commented May 8, 2023

TheBlueMatt commented May 8, 2023

TheBlueMatt commented May 9, 2023

tnull commented May 9, 2023 • edited Loading

andrei-21 commented May 9, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

TheBlueMatt commented May 10, 2023

tnull commented May 10, 2023

TheBlueMatt commented May 10, 2023

tnull commented May 9, 2023 •

edited

Loading