Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never block a thread on the PeerManager event handling lock #2280

Merged
merged 1 commit into from
May 24, 2023

Conversation

TheBlueMatt
Copy link
Collaborator

If thre's a thread currently handling PeerManager events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing thread, returning immediately after signaling that the first thread should start over once its complete to ensure all events are handled.

While this could lead to starvation as we cause one thread to go around and around and around again, the risk of that should be relatively low as event handling should be pretty quick, and it's certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32

@codecov-commenter
Copy link

codecov-commenter commented May 9, 2023

Codecov Report

Patch coverage: 34.62% and project coverage change: +1.03 🎉

Comparison is base (0ecb4b0) 90.94% compared to head (3a68150) 91.97%.

❗ Current head 3a68150 differs from pull request most recent head 0c034e9. Consider uploading reports for the commit 0c034e9 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2280      +/-   ##
==========================================
+ Coverage   90.94%   91.97%   +1.03%     
==========================================
  Files         104      104              
  Lines       52741    64504   +11763     
  Branches    52741    64504   +11763     
==========================================
+ Hits        47966    59330   +11364     
- Misses       4775     5174     +399     
Impacted Files Coverage Δ
lightning/src/ln/peer_handler.rs 64.67% <34.16%> (+7.24%) ⬆️
lightning/src/util/test_utils.rs 76.51% <100.00%> (+4.12%) ⬆️

... and 28 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@wpaulino
Copy link
Contributor

wpaulino commented May 9, 2023

If thre's a thread currently handling PeerManager events, the next thread which attempts to handle events will block on the first and then handle events after the first completes. (later threads will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave blocked, but if they don't (e.g. are running with a single-threaded tokio runtime) this can lead to a full deadlock.

Can't this also happen now with ChannelManager::process_pending_events? We already loop until we process all events there, so there's no need to block any other threads that also call it?

@TheBlueMatt
Copy link
Collaborator Author

We used to I think but we no longer do, a second thread trying to process events in CM will now return immediately.

wpaulino
wpaulino previously approved these changes May 9, 2023
@andrei-21
Copy link
Contributor

I would suggest to encapsulate locking logic into a struct and make it a bit differently with incrementing the variable:

use std::sync::atomic::{AtomicI32, Ordering};

pub struct CountingMutex {
    counter: AtomicI32,
}

impl CountingMutex {
    pub fn new() -> CountingMutex {
        CountingMutex {
            counter: AtomicI32::new(0),
        }
    }

    pub fn try_lock(&self) -> bool {
        self.counter.fetch_add(1, Ordering::AcqRel) == 0
    }

    pub fn try_unlock(&self) -> bool {
        let prev = self.counter.fetch_add(-1, Ordering::AcqRel);
	debug_assert!(prev > 0, "CountingMutex is in inconsisten state");
	
	if prev == 1 {
	    return true;
	}
        self.counter.store(1, Ordering::Release);
        false
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let lock = CountingMutex::new();
	// Lock, unlock
        assert!(lock.try_lock());
        assert!(lock.try_unlock());

        assert!(lock.try_lock());
        assert!(!lock.try_lock());
        assert!(!lock.try_unlock());
        assert!(lock.try_unlock());
    }
}

@TheBlueMatt
Copy link
Collaborator Author

Your alternative suggestion doesn't have the "go around again" logic that we need from this PR. We could definitely encapsulate it, though then we're adding a bunch of extra match'es and returning an enum, which would give us almost as much code to call the encapsulted struct as to actually do the atomics.

@andrei-21
Copy link
Contributor

Your alternative suggestion doesn't have the "go around again" logic that we need from this PR.

Sorry forgot to show the usage:

	pub fn process_events(&self) {
		if !self.event_processing_state.try_lock() {
			return;
		}

		loop {
			// The body of the function.
	
			if self.event_processing_state.try_unlock() {
				return;
			}
		}
	}

I view such encapsulation no so much as a reducing amount of code, but reducing the complexity.
Nevertheless from functional point of view your changes look good to me.

@TheBlueMatt
Copy link
Collaborator Author

Right, but we don't have to go around more than once, we only need to go around one more time if any number of other threads signaled to us that they wanted to process events.

@andrei-21
Copy link
Contributor

Right, that is why I set counter to 1 if the value was greater.

But we need to loop the third time if during the second loop there was another thread reaching this function. Right?

@TheBlueMatt
Copy link
Collaborator Author

Ah, apologies, I'd missed the extra store(1) there, indeed, your solution is much simpler and works fine.

@TheBlueMatt
Copy link
Collaborator Author

Took your design, but left out the struct - this is really simple logic, and while encapsulation is great generally its also the case that bugs tend to creep in at the border, so for things that are simple enough that they can be read in-line, I tend to prefer to avoid adding a boundary that requires context switching to read and make sure we're using an API correctly.

@wpaulino
Copy link
Contributor

LGTM after squash.

andrei-21
andrei-21 previously approved these changes May 15, 2023
wpaulino
wpaulino previously approved these changes May 15, 2023
lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved
lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved
lightning/src/ln/peer_handler.rs Show resolved Hide resolved
@TheBlueMatt TheBlueMatt dismissed stale reviews from wpaulino and andrei-21 via 3a68150 May 21, 2023 17:35
@TheBlueMatt
Copy link
Collaborator Author

Pushed a smoke test.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after squash and CI fix

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

Huh, looks like the test is failing on all windows builds which is.....strange af?

If thre's a thread currently handling `PeerManager` events, the
next thread which attempts to handle events will block on the first
and then handle events after the first completes. (later threads
will return immediately to avoid blocking more than one thread).

This works fine as long as the user has a spare thread to leave
blocked, but if they don't (e.g. are running with a single-threaded
tokio runtime) this can lead to a full deadlock.

Instead, here, we never block waiting on another event processing
thread, returning immediately after signaling that the first thread
should start over once its complete to ensure all events are
handled.

While this could lead to starvation as we cause one thread to go
around and around and around again, the risk of that should be
relatively low as event handling should be pretty quick, and it's
certainly better than deadlocking.

Fixes lightningdevkit/rapid-gossip-sync-server#32

Atomic lock simplification suggestion from @andrei-21
@TheBlueMatt
Copy link
Collaborator Author

Huh! TIL windowz doesn't bother interrupting a thread if you only have one CPU core and the other threads waiting to run are in the same process...what a terrible OS. Anyway, squashed with a trivial fix for Winblowz and the above spelling issue:

$ git diff-tree -U1 3a6815024 0c034e9a8
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 4aa024948..2fcfb713c 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -678,3 +678,3 @@ pub struct PeerManager<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, OM: D
 	node_id_to_descriptor: Mutex<HashMap<PublicKey, Descriptor>>,
-	/// We can only have one thread processing events at once, but if second call to
+	/// We can only have one thread processing events at once, but if a second call to
 	/// `process_events` happens while a first call is in progress, one of the two calls needs to
@@ -2918,2 +2918,3 @@ mod tests {
 			assert!(val <= 2);
+			std::thread::yield_now(); // Winblowz seemingly doesn't ever interrupt threads?!
 		}

@wpaulino wpaulino merged commit 64c58a5 into lightningdevkit:main May 24, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RGS randomly hangs after sometime
5 participants