Fix race: Enable the bank-drop-callback *after* setting the callback#23970
Fix race: Enable the bank-drop-callback *after* setting the callback#23970brooksprumo wants to merge 1 commit intosolana-labs:masterfrom
Conversation
| bank_forks | ||
| .write() |
There was a problem hiding this comment.
Note that I'm grabbing a write lock on the bank-forks now.
I'm assuming since this is still during startup that there will not be any contention. I wanted to prevent any possibility of another thread holding a read lock and then getting a bank before the callbacks are set. This shouldn't be an issue though, since the ordering of the is-bank-drop-callback-enabled flag has been fixed. Wdyt? Seem ok?
| /// | ||
| /// This fn shall be called only once, and before Replay starts, IFF AccountsBackgroundService | ||
| /// shall be responsible for calling AccountsDb::purge_slot() to clean up dropped banks. | ||
| pub fn set_bank_drop_callback(&mut self, bank_drop_callback: SendDroppedBankCallback) { |
There was a problem hiding this comment.
Here's the impl for setting the bank drop callbacks. Note the &mut self that I mentioned in the previous comment. This fn doesn't need to be on BankForks, but I found that the fn is the most comfortable and obvious here.
| /// Set the bank-drop-callback flag to ENABLED | ||
| pub(crate) fn enable_bank_drop_callback(&self) { | ||
| self.is_bank_drop_callback_enabled | ||
| .store(true, Ordering::SeqCst); |
There was a problem hiding this comment.
Also, I don't see why this atomic needs Sequential Consistency for its ordering. From what I can tell, this .store() could use Release, and the .load() below could use Acquire.
I wanted to keep this PR as simple as possible, so exploring different orderings is left for a subsequent PR.
Codecov Report
@@ Coverage Diff @@
## master #23970 +/- ##
=========================================
- Coverage 81.8% 81.7% -0.1%
=========================================
Files 581 585 +4
Lines 158312 159293 +981
=========================================
+ Hits 129518 130230 +712
- Misses 28794 29063 +269 |
|
@brooksprumo thanks for digging into this, have to admit it's quite a subtle bug in a poorly designed code path. This fix though probably wont resolve the issue, because any bank that exists in solana/runtime/src/bank_forks.rs Line 511 in 1228935 BankForks
This probably means whichever bank that's being dropped and causing the panic:
|
|
It's easy to make the panic go away if we only check the
But not sure if this is the correct thing to do. The decision hinges on whether or not running Namely it's suggested that this assert: solana/runtime/src/accounts_db.rs Lines 4277 to 4279 in 1228935 clean_accounts(), and both do a partial clean of this slot. In this case, the purge_slots() call will not completely remove the slot, and the assert will trigger
|
|
@carllin Yep, you're 100% right! The TransactionStatusService's sender is keeping these old banks alive. Here's more specifics in the issue: #23976 (comment) Independently, @mvines is working on #23852, which may resolve all this by removing the special-case nature of blockstore processing during startup. I agree that this PR does not fix the underlying panic (as described in #23976). I also think this PR is still valuable on its own, as it closes a potential race condition. |
There will always be the potential references exist to banks outside of
@brooksprumo let's dig into 1. first, starting with the code mentioned in my comment above
There's no race for the banks that exist in BankForks because we always have a reference to them in |
|
@brooksprumo I think it's actually unsafe to have these stray |
|
If PR #24142 gets merged, this PR will be obviated (for v1.11) and closed. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
Problem
There is a possible race between enabling the bank-drop-callback flag and setting the callback itself.
Bank::drop()decides to use the drop-callback or not based on if it has a drop-callbackBank::drop()will callAccountsDb::purge_slot()directly, passingfalsefor theis_absparameterAccountsDb::purge_slot()checks theis_bank_drop_callback_enabledflag, and panics if it is set and not called from AccountsBackgroundServiceTvu::new()does:SendDroppedBankCallback::new(pruned_banks_sender))AccountsDb::is_bank_drop_callback_enabledflag to true<-- possible bank drop here -->
So if a bank gets dropped between 2 and 3, then the bank will not have its drop-callback set but the
is_bank_drop_callback_enabledflag is set, soBank::drop()will callAccountsDb::purge_slot()directly, and that will trigger a panic sinceis_bank_drop_callback_enabledis true andis_absis false.Additional Questions
Arc<Bank>, but which could be running whenTvu::new()is in flight?Tvu::new(), but that could've been running at the same time?Summary of Changes
Set the bank drop callback for all the banks before enabling the flag.