Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RGS randomly hangs after sometime #32

Closed
andrei-21 opened this issue May 8, 2023 · 10 comments · Fixed by lightningdevkit/rust-lightning#2280
Closed

RGS randomly hangs after sometime #32

andrei-21 opened this issue May 8, 2023 · 10 comments · Fixed by lightningdevkit/rust-lightning#2280

Comments

@andrei-21
Copy link
Contributor

Current Behavior:

RGS randomly hangs after sometime.

Expected Behavior:

RGS does not hang if the database is slow.

Steps To Reproduce:

  1. Build a version with 2 worker threads (or run on a machine with two cores)
    #[tokio::main(flavor = "multi_thread", worker_threads = 2)]
  2. Run on testnet with
LN_PEERS=02eadbd9e7557375161df8b646776a547c5cbc2e95b3071ec81553f8ec2cea3b8c@18.191.253.246:9735,03bae2db4b57738c1ec1ffa1c5e5a4423968cc592b3b39cddf7d495e72919d6431@18.202.91.172:9735,038863cf8ab91046230f561cd5b386cbff8309fa02e3f0c3ed161a3aeb64a643b9@203.132.94.196:9735

(more peers — higher the changes to hangs)

What Happens

The problem is with mpsc::channel(100) for GossipMessage. There is a task which receives elements in GossipPersister and tasks which send elements in GossipRouter.
GossipRouter uses try_send() method to send without blocking the thread, but if it fails (when the channel is full) it uses blocking send(), what blocks the thread. But this thread is a tokio executor thread, if all tokio executor threads get blocked in GossipRouter::new_channel_announcement() or GossipRouter::new_channel_update() then the task to receive elements from the channel will never be executed. Deadlock.

Note that there is a code in GossipRouter which tries to minimize the risk of deadlock, but unfortunate it does not eliminate it:

tokio::task::block_in_place(move || { tokio::runtime::Handle::current().block_on(async move {
				self.sender.send(gossip_message).await.unwrap();
			})});
@tnull
Copy link

tnull commented May 8, 2023

Note that there is a code in GossipRouter which tries to minimize the risk of deadlock, but unfortunate it does not eliminate it:

tokio::task::block_in_place(move || { tokio::runtime::Handle::current().block_on(async move {
				self.sender.send(gossip_message).await.unwrap();
			})});

Ah yeah, this is bound to fail, we shouldn't ever block_on anywhere. Note that the "mitigation" doesn't work as block_in_place is a no-op outside of a tokio runtime context. To quote the surprisingly brief tokio docs on the matter:

On the other hand, calling the function outside a runtime is allowed. In this case, block_in_place just calls the provided closure normally.

@TheBlueMatt
Copy link
Contributor

Wait, I'm confused on how this causes a deadlock, the mitigation in this case is called from a tokio thread (there are no non-tokio threads in the RGS server) so it should happily create a new thread for the duration of the block, and I don't immediately see any deadlock in the GossipPersister receive, but I may have missed one there.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue May 9, 2023
If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32
@TheBlueMatt
Copy link
Contributor

I believe this should be fixed by lightningdevkit/rust-lightning#2280 (untested, though). I didn't manage to reproduce this issue with two threads, but with one thread I hit what looks like it should be fixed by that.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue May 9, 2023
If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue May 9, 2023
If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32
@tnull
Copy link

tnull commented May 9, 2023

Wait, I'm confused on how this causes a deadlock, the mitigation in this case is called from a tokio thread (there are no non-tokio threads in the RGS server) so it should happily create a new thread for the duration of the block, and I don't immediately see any deadlock in the GossipPersister receive, but I may have missed one there.

Ah, hum, good point, we're indeed in a tokio context in this case.

@andrei-21
Copy link
Contributor Author

Wait, I'm confused on how this causes a deadlock, the mitigation in this case is called from a tokio thread (there are no non-tokio threads in the RGS server) so it should happily create a new thread for the duration of the block, and I don't immediately see any deadlock in the GossipPersister receive, but I may have missed one there.

Hm, you are right. My interpretation was wrong. But my fix is still applicable.

I checked the backtraces and indeed there is a thread waiting for a lock on self.event_processing_lock.
Let me try your fix if it works on the machine where I am experiencing the problem.

@andrei-21
Copy link
Contributor Author

I built a version with @TheBlueMatt fix and it solves the problem!

@TheBlueMatt
Copy link
Contributor

Cool gonna go ahead and close this since github doesn't do the magic close-on-merge thing across repos.

@TheBlueMatt
Copy link
Contributor

Oh wait, maybe it does!

@TheBlueMatt TheBlueMatt reopened this May 10, 2023
@tnull
Copy link

tnull commented May 10, 2023

And why reopened now? 😁

@TheBlueMatt
Copy link
Contributor

Cause I want to find out if GitHub magics here:)

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue May 11, 2023
If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32

Atomic lock simplification suggestion from @andrei-21
henghonglee pushed a commit to henghonglee/rust-lightning that referenced this issue May 24, 2023
If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32

Atomic lock simplification suggestion from @andrei-21
TheBlueMatt added a commit to TheBlueMatt/rapid-gossip-sync-server that referenced this issue Jul 16, 2023
This includes a fix for the deadlock in lightningdevkit#32.
TheBlueMatt added a commit to TheBlueMatt/rapid-gossip-sync-server that referenced this issue Jul 16, 2023
This includes a fix for the deadlock in lightningdevkit#32.
optout21 pushed a commit to optout21/rust-lightning that referenced this issue Jul 24, 2023
If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32

Atomic lock simplification suggestion from @andrei-21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants