ThreadManager: store Weak instead of Arc pool#5826
ThreadManager: store Weak instead of Arc pool#5826apfitzge wants to merge 2 commits intoanza-xyz:masterfrom
Conversation
| .pool | ||
| .upgrade() | ||
| .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count) | ||
| .unwrap_or_default() |
There was a problem hiding this comment.
AFAICT this value is used by cleaner to possibly drop or return to pool, so the return value here doesn't matter much if the pool no longer exists.
There was a problem hiding this comment.
Might be good to call that out in a comment here or function header
There was a problem hiding this comment.
I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?
There was a problem hiding this comment.
It's an odd set up, and it just can't really happen afaict.
Maybe we should just panic?
The current setup:
SchedulerPool::return_scheduleris_trashedis_overgrown
so is_overgrown is never called (in non-test code at least) unless we have a SchedulerPool.
This makes me wonder...could we just pass the pool into is_trashed?
The only other use of the pool is in return_to_pool, can we just pass it there somehow? Ultimatey it seems called by timeout listeners, which iirc are called by the cleaner loop, which has a weak reference to the pool itself and could pass it in.
That's a bit more restructuring, and I'm not 100% it would work - but would certainly simplify the ownership and reference model - ThreadManager would just no longer have a Pool reference at all.
| usage_queue_loader: UsageQueueLoader, | ||
| next_task_id: AtomicUsize, | ||
| new_task_sender: Sender<NewTaskPayload>, | ||
| new_task_sender: Weak<Sender<NewTaskPayload>>, |
There was a problem hiding this comment.
this is least invasive change to fix the circular dependency on the sender/recv connection in scheduler.
We may want to re-structure the scheduler thread to not hold the entire HandlerContext
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5826 +/- ##
=========================================
- Coverage 83.0% 83.0% -0.1%
=========================================
Files 828 828
Lines 375510 375520 +10
=========================================
- Hits 311857 311848 -9
- Misses 63653 63672 +19 🚀 New features to boost your workflow:
|
| .pool | ||
| .upgrade() | ||
| .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count) | ||
| .unwrap_or_default() |
There was a problem hiding this comment.
Might be good to call that out in a comment here or function header
| .pool | ||
| .upgrade() | ||
| .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count) | ||
| .unwrap_or_default() |
There was a problem hiding this comment.
I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?
| pub fn send_new_task(&self, task: Task) { | ||
| self.new_task_sender | ||
| .upgrade() | ||
| .unwrap() |
There was a problem hiding this comment.
should we instead just drop the task if the upgrade fails?
There was a problem hiding this comment.
I preserved the current behavior of panicing on failure to send; realistically we probably want to return an error and break whatever loops we're in wherever we send these
There was a problem hiding this comment.
yeah, I'm okay w/ following up on that separately
Problems
Problem 1 - Circular Reference of
SchdulerPoolsolScCleanerholds a weak reference to theSchedulerPooland will break only if all Arcs ofSchedulerPoolare droppedArc<SchedulerPool>are held in two places:BankForksandThreadManagerThreadManagerownership chain:PooledSchedulerInnerPooledSchedulerSchedulerPool(circular reference)ThreadManagerusespoolto check status and possibly return to the pool; in either-case if theThreadManagercan be modified to handle if thepoolno longer existsProblem 2 - Circular Ownership model of
new_task_sendernew_task_senderhas been owned byThreadManagernew_task_senderbecame owned viaBankingStageHelperBankingStageHelperis held in the scheduler thread.Errreceived on channel disconnect (since it holds the sender itself!)ThreadManagerisdropped, it will attempt to join all threads,Summary of Changes
ThreadManagermodified to hold aWeakreference to theSchedulerPoolinstead of a strong reference.BankingStageHelpermodified to hold aWeak<Sender<..>>so that scheduler can exit if the actual new_task_sender is dropped (not the handler threads which send retryable transactions).Fixes #5435
Fixes #4211
Testing
Ran
cargo test --package solana-local-cluster --test local_cluster -- test_snapshot_restart_tower --exact --show-output.Added some additional error logging when ReplayStage loop exited, BankForks was dropped, unified-scheduler cleaner thread exits, etc.
At the end of the test, wait 10s after everything is dropped. If the cleaner is logging at that point it has stayed alive when it should not have.
Last few log lines from patched test:
Note in the
masterlogs, the pool(STONG_COUNT) still reads 2, indicating there are still 2 remaining strong counts. One is the upgraded pool used in the cleaning loop itself - the other is the stuckThreadManager.