Never block a thread on the `PeerManager` event handling lock #2280

TheBlueMatt · 2023-05-09T00:36:47Z

If thre's a thread currently handling PeerManager events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled.

While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32

codecov-commenter · 2023-05-09T02:48:32Z

Codecov Report

Patch coverage: 34.62% and project coverage change: +1.03 🎉

Comparison is base (0ecb4b0) 90.94% compared to head (3a68150) 91.97%.

❗ Current head 3a68150 differs from pull request most recent head 0c034e9. Consider uploading reports for the commit 0c034e9 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2280      +/-   ##
==========================================
+ Coverage   90.94%   91.97%   +1.03%     
==========================================
  Files         104      104              
  Lines       52741    64504   +11763     
  Branches    52741    64504   +11763     
==========================================
+ Hits        47966    59330   +11364     
- Misses       4775     5174     +399

Impacted Files	Coverage Δ
lightning/src/ln/peer_handler.rs	`64.67% <34.16%> (+7.24%)`	⬆️
lightning/src/util/test_utils.rs	`76.51% <100.00%> (+4.12%)`	⬆️

... and 28 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

wpaulino · 2023-05-09T17:52:33Z

If thre's a thread currently handling PeerManager events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock.

Can't this also happen now with ChannelManager::process_pending_events? We already loop until we process all events there, so there's no need to block any other threads that also call it?

TheBlueMatt · 2023-05-09T18:00:39Z

We used to I think but we no longer do, a second thread trying to process events in CM will now return immediately.

andrei-21 · 2023-05-10T10:27:43Z

I would suggest to encapsulate locking logic into a struct and make it a bit differently with incrementing the variable:

use std::sync::atomic::{AtomicI32, Ordering};

pub struct CountingMutex {
    counter: AtomicI32,
}

impl CountingMutex {
    pub fn new() -> CountingMutex {
        CountingMutex {
            counter: AtomicI32::new(0),
        }
    }

    pub fn try_lock(&self) -> bool {
        self.counter.fetch_add(1, Ordering::AcqRel) == 0
    }

    pub fn try_unlock(&self) -> bool {
        let prev = self.counter.fetch_add(-1, Ordering::AcqRel);
	debug_assert!(prev > 0, "CountingMutex is in inconsisten state");
	
	if prev == 1 {
	    return true;
	}
        self.counter.store(1, Ordering::Release);
        false
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let lock = CountingMutex::new();
	// Lock, unlock
        assert!(lock.try_lock());
        assert!(lock.try_unlock());

        assert!(lock.try_lock());
        assert!(!lock.try_lock());
        assert!(!lock.try_unlock());
        assert!(lock.try_unlock());
    }
}

TheBlueMatt · 2023-05-10T19:12:11Z

Your alternative suggestion doesn't have the "go around again" logic that we need from this PR. We could definitely encapsulate it, though then we're adding a bunch of extra match'es and returning an enum, which would give us almost as much code to call the encapsulted struct as to actually do the atomics.

andrei-21 · 2023-05-10T19:31:17Z

Your alternative suggestion doesn't have the "go around again" logic that we need from this PR.

Sorry forgot to show the usage:

	pub fn process_events(&self) {
		if !self.event_processing_state.try_lock() {
			return;
		}

		loop {
			// The body of the function.
	
			if self.event_processing_state.try_unlock() {
				return;
			}
		}
	}

I view such encapsulation no so much as a reducing amount of code, but reducing the complexity.
Nevertheless from functional point of view your changes look good to me.

TheBlueMatt · 2023-05-10T20:53:26Z

Right, but we don't have to go around more than once, we only need to go around one more time if any number of other threads signaled to us that they wanted to process events.

andrei-21 · 2023-05-10T20:57:15Z

Right, that is why I set counter to 1 if the value was greater.

But we need to loop the third time if during the second loop there was another thread reaching this function. Right?

TheBlueMatt · 2023-05-10T21:01:54Z

Ah, apologies, I'd missed the extra store(1) there, indeed, your solution is much simpler and works fine.

TheBlueMatt · 2023-05-10T21:17:54Z

Took your design, but left out the struct - this is really simple logic, and while encapsulation is great generally its also the case that bugs tend to creep in at the border, so for things that are simple enough that they can be read in-line, I tend to prefer to avoid adding a boundary that requires context switching to read and make sure we're using an API correctly.

wpaulino · 2023-05-10T23:24:17Z

LGTM after squash.

lightning/src/ln/peer_handler.rs

TheBlueMatt · 2023-05-21T17:35:50Z

Pushed a smoke test.

valentinewallace

LGTM after squash and CI fix

lightning/src/ln/peer_handler.rs

TheBlueMatt · 2023-05-24T00:51:07Z

Huh, looks like the test is failing on all windows builds which is.....strange af?

@andrei-21

If thre's a thread currently handling `PeerManager` events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread). This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock. Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled. While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking. Fixes lightningdevkit/rapid-gossip-sync-server#32 Atomic lock simplification suggestion from @andrei-21

TheBlueMatt · 2023-05-24T02:40:56Z

Huh! TIL windowz doesn't bother interrupting a thread if you only have one CPU core and the other threads waiting to run are in the same process...what a terrible OS. Anyway, squashed with a trivial fix for Winblowz and the above spelling issue:

$ git diff-tree -U1 3a6815024 0c034e9a8
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 4aa024948..2fcfb713c 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -678,3 +678,3 @@ pub struct PeerManager<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, OM: D
 	node_id_to_descriptor: Mutex<HashMap<PublicKey, Descriptor>>,
-	/// We can only have one thread processing events at once, but if second call to
+	/// We can only have one thread processing events at once, but if a second call to
 	/// `process_events` happens while a first call is in progress, one of the two calls needs to
@@ -2918,2 +2918,3 @@ mod tests {
 			assert!(val <= 2);
+			std::thread::yield_now(); // Winblowz seemingly doesn't ever interrupt threads?!
 		}

TheBlueMatt added this to the 0.0.116 milestone May 9, 2023

TheBlueMatt mentioned this pull request May 9, 2023

RGS randomly hangs after sometime lightningdevkit/rapid-gossip-sync-server#32

Closed

TheBlueMatt force-pushed the 2023-05-event-deadlock branch 2 times, most recently from 8aede17 to f868d86 Compare May 9, 2023 01:27

wpaulino previously approved these changes May 9, 2023

View reviewed changes

TheBlueMatt added the Seeking Code Review label May 9, 2023

TheBlueMatt dismissed wpaulino’s stale review via 7f35f63 May 10, 2023 21:14

TheBlueMatt force-pushed the 2023-05-event-deadlock branch from 7f35f63 to 47cff1e Compare May 10, 2023 21:18

andrei-21 approved these changes May 10, 2023

View reviewed changes

TheBlueMatt force-pushed the 2023-05-event-deadlock branch from 47cff1e to 7925c52 Compare May 11, 2023 01:24

andrei-21 previously approved these changes May 15, 2023

View reviewed changes

wpaulino previously approved these changes May 15, 2023

View reviewed changes

valentinewallace self-requested a review May 17, 2023 16:09

valentinewallace removed the Seeking Code Review label May 17, 2023

valentinewallace assigned wpaulino and valentinewallace May 17, 2023

valentinewallace reviewed May 17, 2023

View reviewed changes

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Show resolved Hide resolved

TheBlueMatt dismissed stale reviews from wpaulino and andrei-21 via 3a68150 May 21, 2023 17:35

valentinewallace reviewed May 22, 2023

View reviewed changes

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2023-05-event-deadlock branch from 3a68150 to 0c034e9 Compare May 24, 2023 02:39

wpaulino approved these changes May 24, 2023

View reviewed changes

valentinewallace approved these changes May 24, 2023

View reviewed changes

wpaulino merged commit 64c58a5 into lightningdevkit:main May 24, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Never block a thread on the `PeerManager` event handling lock #2280

Never block a thread on the `PeerManager` event handling lock #2280

TheBlueMatt commented May 9, 2023

codecov-commenter commented May 9, 2023 •

edited

Loading

wpaulino commented May 9, 2023

TheBlueMatt commented May 9, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

TheBlueMatt commented May 10, 2023

wpaulino commented May 10, 2023

TheBlueMatt commented May 21, 2023

valentinewallace left a comment

TheBlueMatt commented May 24, 2023

TheBlueMatt commented May 24, 2023

Never block a thread on the PeerManager event handling lock #2280

Never block a thread on the PeerManager event handling lock #2280

Conversation

TheBlueMatt commented May 9, 2023

codecov-commenter commented May 9, 2023 • edited Loading

Codecov Report

wpaulino commented May 9, 2023

TheBlueMatt commented May 9, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

andrei-21 commented May 10, 2023

TheBlueMatt commented May 10, 2023

TheBlueMatt commented May 10, 2023

wpaulino commented May 10, 2023

TheBlueMatt commented May 21, 2023

valentinewallace left a comment

Choose a reason for hiding this comment

TheBlueMatt commented May 24, 2023

TheBlueMatt commented May 24, 2023

Never block a thread on the `PeerManager` event handling lock #2280

Never block a thread on the `PeerManager` event handling lock #2280

codecov-commenter commented May 9, 2023 •

edited

Loading