Skip to content

fix(banking-stage): shutdown log spam#9417

Merged
OliverNChalk merged 1 commit intoanza-xyz:masterfrom
OliverNChalk:fix/banking-shutdown-log-spam
Jan 21, 2026
Merged

fix(banking-stage): shutdown log spam#9417
OliverNChalk merged 1 commit intoanza-xyz:masterfrom
OliverNChalk:fix/banking-shutdown-log-spam

Conversation

@OliverNChalk
Copy link
Copy Markdown

Problem

  • The scheduler controller will shutdown if the upstream sender disconnects. This is seen as an "unexpected exit" by the manager thread and the scheduler is restarted infinitely (until the shutdown signal reaches the manager).

Summary of Changes

  • If we shutdown the scheduler due to an upstream disconnect, then we shutdown he manager too as there is no point in spawning additional schedulers.

@OliverNChalk OliverNChalk marked this pull request as ready for review December 4, 2025 21:34
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 78.26087% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.6%. Comparing base (16af42c) to head (ebf62ae).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #9417     +/-   ##
=========================================
- Coverage    82.7%    82.6%   -0.2%     
=========================================
  Files         843      852      +9     
  Lines      315498   318151   +2653     
=========================================
+ Hits       261105   262935   +1830     
- Misses      54393    55216    +823     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread core/src/banking_stage.rs
tao-stones
tao-stones previously approved these changes Dec 5, 2025
Copy link
Copy Markdown

@tao-stones tao-stones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, assume you find way to test it?

@OliverNChalk
Copy link
Copy Markdown
Author

lgtm, assume you find way to test it?

Yep, tested manually on devnet by starting & stopping an rpc

},
time::Instant,
},
tokio_util::sync::CancellationToken,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this suitable outside async contexts?

Copy link
Copy Markdown
Author

@OliverNChalk OliverNChalk Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure, it's just either an atomic operation if the runtime isnt parked or a syscall to unpark the runtime (should be a write syscall for our single threaded runtime).

Fairly sure this is where we get to on our sync caller side:

    pub fn wake(self) {
        // The actual wakeup call is delegated through a virtual function call
        // to the implementation which is defined by the executor.

        // Don't call `drop` -- the waker will be consumed by `wake`.
        let this = ManuallyDrop::new(self);

        // SAFETY: This is safe because `Waker::from_raw` is the only way
        // to initialize `wake` and `data` requiring the user to acknowledge
        // that the contract of `RawWaker` is upheld.
        unsafe { (this.waker.vtable.wake)(this.waker.data) };
    }

After this the queued waker runs which is type erased. Most likely this is the wake method on the other side:

    pub(crate) fn wake(&self) -> io::Result<()> {
        // The epoll emulation on some illumos systems currently requires
        // the eventfd to be read before an edge-triggered read event is
        // generated.
        // See https://www.illumos.org/issues/16700.
        #[cfg(target_os = "illumos")]
        self.reset()?;

        let buf: [u8; 8] = 1u64.to_ne_bytes();
        match (&self.fd).write(&buf) {
            Ok(_) => Ok(()),
            Err(ref err) if err.kind() == io::ErrorKind::WouldBlock => {
                // Writing only blocks if the counter is going to overflow.
                // So we'll reset the counter to 0 and wake it again.
                self.reset()?;
                self.wake()
            }
            Err(err) => Err(err),
        }
    }

Comment thread core/src/banking_stage.rs
@OliverNChalk OliverNChalk force-pushed the fix/banking-shutdown-log-spam branch from 35069b9 to e091a3b Compare December 5, 2025 21:28
tao-stones
tao-stones previously approved these changes Dec 5, 2025
Copy link
Copy Markdown

@tao-stones tao-stones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sorting it out for both internal and external paths.

@t-nelson
Copy link
Copy Markdown

t-nelson commented Jan 7, 2026

this will be superseded by #9786, right?

@OliverNChalk
Copy link
Copy Markdown
Author

OliverNChalk commented Jan 7, 2026

this will be superseded by #9786, right?

No these are unrelated. Cavey's fix is a perf issue that results in "poh tick reached" spam. This fix results in "worker shutdown unexpectedly" spam. Basically the workers see the exit signal before the management thread and it causes the management thread to spam warn/error.

EDIT: Forgot about this PR so will refresh myself and determine if its mergeable

@t-nelson
Copy link
Copy Markdown

t-nelson commented Jan 7, 2026

wow. so much spam. the logs are just like my inbox

@apfitzge
Copy link
Copy Markdown

anything holding the merge of this up?

@OliverNChalk
Copy link
Copy Markdown
Author

Was waiting on a trent response then forgot. Have synced master in, re self-reviewed, and fixed a typo (recieve -> receive). Re-running CI with latest master then will request re-sign off for merge.

@t-nelson
Copy link
Copy Markdown

r+ sme sign off

i think i'm fine with this fix, but it's necessity hints at an architectural flaw. seems like the exit bool should be enough, but we're not using it correctly

@OliverNChalk
Copy link
Copy Markdown
Author

OliverNChalk commented Jan 20, 2026

r+ sme sign off

i think i'm fine with this fix, but it's necessity hints at an architectural flaw. seems like the exit bool should be enough, but we're not using it correctly

IMO the flaw is in Agave, it uses an exit bool and all threads shutdown simultaneously. In a civilized binary we would have a shutdown sequence rather than a bunch of racey threads all shutting down at the same time. To this end the banking stage actually enforces a reasonable shutdown order (manager waits for all workers to exit cleanly before it itself exits).

Copy link
Copy Markdown

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question about non-vote threads

Comment thread core/src/banking_stage/vote_worker.rs
@OliverNChalk OliverNChalk added this pull request to the merge queue Jan 21, 2026
Merged via the queue into anza-xyz:master with commit 5826527 Jan 21, 2026
47 checks passed
@OliverNChalk OliverNChalk deleted the fix/banking-shutdown-log-spam branch January 21, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants