ThreadManager: store Weak instead of Arc pool by apfitzge · Pull Request #5826 · anza-xyz/agave

apfitzge · 2025-04-15T14:29:48Z

Problems

Problem 1 - Circular Reference of `SchdulerPool`

solScCleaner holds a weak reference to the SchedulerPool and will break only if all Arcs of SchedulerPool are dropped
Arc<SchedulerPool> are held in two places: BankForks and ThreadManager
ThreadManager ownership chain:
- owned by PooledSchedulerInner
  - owned by PooledScheduler
    - owned by SchedulerPool (circular reference)
ThreadManager uses pool to check status and possibly return to the pool; in either-case if the ThreadManager can be modified to handle if the pool no longer exists

Problem 2 - Circular Ownership model of `new_task_sender`

new_task_sender has been owned by ThreadManager
new_task_sender became owned viaBankingStageHelper
- BankingStageHelper is held in the scheduler thread.
- scheduler thread will hang because there is never a Err received on channel disconnect (since it holds the sender itself!)
When ThreadManager is dropped, it will attempt to join all threads,
- the scheduler thread cannot be joined because it will never exit the loop

Summary of Changes

ThreadManager modified to hold a Weak reference to the SchedulerPool instead of a strong reference.
BankingStageHelper modified to hold a Weak<Sender<..>> so that scheduler can exit if the actual new_task_sender is dropped (not the handler threads which send retryable transactions).

Fixes #5435
Fixes #4211

Testing

Ran cargo test --package solana-local-cluster --test local_cluster -- test_snapshot_restart_tower --exact --show-output.
Added some additional error logging when ReplayStage loop exited, BankForks was dropped, unified-scheduler cleaner thread exits, etc.
At the end of the test, wait 10s after everything is dropped. If the cleaner is logging at that point it has stayed alive when it should not have.

Last few log lines from patched test:

master:
[2025-04-15T14:10:04.993411000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:10:05.073398000Z ERROR solana_unified_scheduler_pool] [ThreadId(711)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:10:05.274471000Z ERROR solana_unified_scheduler_pool] [ThreadId(376)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 0 timeout listeners

w/ patch:
[2025-04-15T14:12:01.921534000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:12:01.950369000Z ERROR solana_runtime::bank_forks] [ThreadId(764)] BankForks dropped
[2025-04-15T14:12:02.210603000Z ERROR solana_runtime::bank_forks] [ThreadId(120)] BankForks dropped
[2025-04-15T14:12:02.286298000Z ERROR solana_unified_scheduler_pool] [ThreadId(710)] Cleaner exited
[2025-04-15T14:12:02.326595000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Cleaner exited
[2025-04-15T14:12:02.341763000Z ERROR local_cluster] everything is dropped at this point...if anything is still alive it's a naughty boy

Note in the master logs, the pool(STONG_COUNT) still reads 2, indicating there are still 2 remaining strong counts. One is the upgraded pool used in the cleaning loop itself - the other is the stuck ThreadManager.

apfitzge · 2025-04-15T14:33:02Z

+            .pool
+            .upgrade()
+            .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
+            .unwrap_or_default()


AFAICT this value is used by cleaner to possibly drop or return to pool, so the return value here doesn't matter much if the pool no longer exists.

Might be good to call that out in a comment here or function header

I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?

It's an odd set up, and it just can't really happen afaict.
Maybe we should just panic?

The current setup:

SchedulerPool::return_scheduler

is_trashed

is_overgrown

so is_overgrown is never called (in non-test code at least) unless we have a SchedulerPool.

This makes me wonder...could we just pass the pool into is_trashed?
The only other use of the pool is in return_to_pool, can we just pass it there somehow? Ultimatey it seems called by timeout listeners, which iirc are called by the cleaner loop, which has a weak reference to the pool itself and could pass it in.

That's a bit more restructuring, and I'm not 100% it would work - but would certainly simplify the ownership and reference model - ThreadManager would just no longer have a Pool reference at all.

apfitzge · 2025-04-15T15:56:34Z

    usage_queue_loader: UsageQueueLoader,
    next_task_id: AtomicUsize,
-    new_task_sender: Sender<NewTaskPayload>,
+    new_task_sender: Weak<Sender<NewTaskPayload>>,


this is least invasive change to fix the circular dependency on the sender/recv connection in scheduler.
We may want to re-structure the scheduler thread to not hold the entire HandlerContext

codecov-commenter · 2025-04-15T16:31:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.0%. Comparing base (c5edcfb) to head (6b670c3).
⚠️ Report is 3601 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #5826     +/-   ##
=========================================
- Coverage    83.0%    83.0%   -0.1%     
=========================================
  Files         828      828             
  Lines      375510   375520     +10     
=========================================
- Hits       311857   311848      -9     
- Misses      63653    63672     +19

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bw-solana · 2025-04-15T20:00:37Z

+            .pool
+            .upgrade()
+            .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
+            .unwrap_or_default()


Might be good to call that out in a comment here or function header

bw-solana · 2025-04-15T20:02:40Z

+            .pool
+            .upgrade()
+            .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
+            .unwrap_or_default()


I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?

bw-solana · 2025-04-15T20:03:27Z

    pub fn send_new_task(&self, task: Task) {
        self.new_task_sender
+            .upgrade()
+            .unwrap()


should we instead just drop the task if the upgrade fails?

I preserved the current behavior of panicing on failure to send; realistically we probably want to return an error and break whatever loops we're in wherever we send these

yeah, I'm okay w/ following up on that separately

ThreadManager: store Weak instead of Arc pool

a7be82c

apfitzge commented Apr 15, 2025

View reviewed changes

Handler - hold Weak new_task_sender

6b670c3

apfitzge commented Apr 15, 2025

View reviewed changes

apfitzge marked this pull request as ready for review April 15, 2025 16:58

apfitzge requested review from bw-solana, jstarry and ryoqun April 15, 2025 16:58

bw-solana reviewed Apr 15, 2025

View reviewed changes

ryoqun mentioned this pull request Apr 17, 2025

Finally introduce sane unified scheduler shutdown #5866

Merged

apfitzge closed this Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThreadManager: store Weak instead of Arc pool#5826

ThreadManager: store Weak instead of Arc pool#5826
apfitzge wants to merge 2 commits intoanza-xyz:masterfrom
apfitzge:weak_thread_manager

apfitzge commented Apr 15, 2025 •

edited

Loading

Uh oh!

apfitzge Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

apfitzge Apr 16, 2025

Uh oh!

apfitzge Apr 15, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 15, 2025 •

edited

Loading

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

apfitzge Apr 16, 2025

Uh oh!

bw-solana Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

apfitzge commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problems

Problem 1 - Circular Reference of SchdulerPool

Problem 2 - Circular Ownership model of new_task_sender

Summary of Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apfitzge Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apfitzge commented Apr 15, 2025 •

edited

Loading

Problem 1 - Circular Reference of `SchdulerPool`

Problem 2 - Circular Ownership model of `new_task_sender`

apfitzge Apr 15, 2025 •

edited

Loading

codecov-commenter commented Apr 15, 2025 •

edited

Loading