fix(banking-stage): shutdown log spam#9417
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9417 +/- ##
=========================================
- Coverage 82.7% 82.6% -0.2%
=========================================
Files 843 852 +9
Lines 315498 318151 +2653
=========================================
+ Hits 261105 262935 +1830
- Misses 54393 55216 +823 🚀 New features to boost your workflow:
|
tao-stones
left a comment
There was a problem hiding this comment.
lgtm, assume you find way to test it?
4298e26 to
216272c
Compare
Yep, tested manually on devnet by starting & stopping an rpc |
| }, | ||
| time::Instant, | ||
| }, | ||
| tokio_util::sync::CancellationToken, |
There was a problem hiding this comment.
is this suitable outside async contexts?
There was a problem hiding this comment.
Pretty sure, it's just either an atomic operation if the runtime isnt parked or a syscall to unpark the runtime (should be a write syscall for our single threaded runtime).
Fairly sure this is where we get to on our sync caller side:
pub fn wake(self) {
// The actual wakeup call is delegated through a virtual function call
// to the implementation which is defined by the executor.
// Don't call `drop` -- the waker will be consumed by `wake`.
let this = ManuallyDrop::new(self);
// SAFETY: This is safe because `Waker::from_raw` is the only way
// to initialize `wake` and `data` requiring the user to acknowledge
// that the contract of `RawWaker` is upheld.
unsafe { (this.waker.vtable.wake)(this.waker.data) };
}After this the queued waker runs which is type erased. Most likely this is the wake method on the other side:
pub(crate) fn wake(&self) -> io::Result<()> {
// The epoll emulation on some illumos systems currently requires
// the eventfd to be read before an edge-triggered read event is
// generated.
// See https://www.illumos.org/issues/16700.
#[cfg(target_os = "illumos")]
self.reset()?;
let buf: [u8; 8] = 1u64.to_ne_bytes();
match (&self.fd).write(&buf) {
Ok(_) => Ok(()),
Err(ref err) if err.kind() == io::ErrorKind::WouldBlock => {
// Writing only blocks if the counter is going to overflow.
// So we'll reset the counter to 0 and wake it again.
self.reset()?;
self.wake()
}
Err(err) => Err(err),
}
}35069b9 to
e091a3b
Compare
tao-stones
left a comment
There was a problem hiding this comment.
Thanks for sorting it out for both internal and external paths.
|
this will be superseded by #9786, right? |
No these are unrelated. Cavey's fix is a perf issue that results in "poh tick reached" spam. This fix results in "worker shutdown unexpectedly" spam. Basically the workers see the exit signal before the management thread and it causes the management thread to spam warn/error. EDIT: Forgot about this PR so will refresh myself and determine if its mergeable |
|
wow. so much spam. the logs are just like my inbox |
|
anything holding the merge of this up? |
e091a3b to
ebf62ae
Compare
|
Was waiting on a trent response then forgot. Have synced master in, re self-reviewed, and fixed a typo (recieve -> receive). Re-running CI with latest master then will request re-sign off for merge. |
|
r+ sme sign off i think i'm fine with this fix, but it's necessity hints at an architectural flaw. seems like the exit bool should be enough, but we're not using it correctly |
IMO the flaw is in Agave, it uses an exit bool and all threads shutdown simultaneously. In a civilized binary we would have a shutdown sequence rather than a bunch of racey threads all shutting down at the same time. To this end the banking stage actually enforces a reasonable shutdown order (manager waits for all workers to exit cleanly before it itself exits). |
Problem
Summary of Changes