Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Fix race: Enable the bank-drop-callback *after* setting the callback#23970

Closed
brooksprumo wants to merge 1 commit intosolana-labs:masterfrom
brooksprumo:fix-bank-drop
Closed

Fix race: Enable the bank-drop-callback *after* setting the callback#23970
brooksprumo wants to merge 1 commit intosolana-labs:masterfrom
brooksprumo:fix-bank-drop

Conversation

@brooksprumo
Copy link
Copy Markdown
Contributor

@brooksprumo brooksprumo commented Mar 28, 2022

Problem

There is a possible race between enabling the bank-drop-callback flag and setting the callback itself.

  • Bank::drop() decides to use the drop-callback or not based on if it has a drop-callback
    • If the drop-callback is not set, then Bank::drop() will call AccountsDb::purge_slot() directly, passing false for the is_abs parameter
  • AccountsDb::purge_slot() checks the is_bank_drop_callback_enabled flag, and panics if it is set and not called from AccountsBackgroundService
  • Tvu::new() does:
    1. Creates the bank-drop-callback (SendDroppedBankCallback::new(pruned_banks_sender))
    2. Sets the AccountsDb::is_bank_drop_callback_enabled flag to true
      <-- possible bank drop here -->
    3. Sets the drop-callback on all the banks

So if a bank gets dropped between 2 and 3, then the bank will not have its drop-callback set but the is_bank_drop_callback_enabled flag is set, so Bank::drop() will call AccountsDb::purge_slot() directly, and that will trigger a panic since is_bank_drop_callback_enabled is true and is_abs is false.

Additional Questions
  • Can banks get dropped between 2 and 3? There are many concurrent threads that hold an Arc<Bank>, but which could be running when Tvu::new() is in flight?
  • Before, the banks had their drop-callback set while iterating through BankForks with a read lock; is there any code that could've been messing with the banks, not in Tvu::new(), but that could've been running at the same time?

Summary of Changes

Set the bank drop callback for all the banks before enabling the flag.

Comment thread core/src/tvu.rs
Comment on lines +255 to +256
bank_forks
.write()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm grabbing a write lock on the bank-forks now.

I'm assuming since this is still during startup that there will not be any contention. I wanted to prevent any possibility of another thread holding a read lock and then getting a bank before the callbacks are set. This shouldn't be an issue though, since the ordering of the is-bank-drop-callback-enabled flag has been fixed. Wdyt? Seem ok?

Comment thread runtime/src/bank_forks.rs
///
/// This fn shall be called only once, and before Replay starts, IFF AccountsBackgroundService
/// shall be responsible for calling AccountsDb::purge_slot() to clean up dropped banks.
pub fn set_bank_drop_callback(&mut self, bank_drop_callback: SendDroppedBankCallback) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the impl for setting the bank drop callbacks. Note the &mut self that I mentioned in the previous comment. This fn doesn't need to be on BankForks, but I found that the fn is the most comfortable and obvious here.

@brooksprumo brooksprumo marked this pull request as ready for review March 28, 2022 15:56
@brooksprumo brooksprumo added the bug Something isn't working label Mar 28, 2022
@brooksprumo brooksprumo changed the title Enable the bank-drop-callback flag *after* setting the callback Fix race: Enable the bank-drop-callback *after* setting the callback Mar 28, 2022
/// Set the bank-drop-callback flag to ENABLED
pub(crate) fn enable_bank_drop_callback(&self) {
self.is_bank_drop_callback_enabled
.store(true, Ordering::SeqCst);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't see why this atomic needs Sequential Consistency for its ordering. From what I can tell, this .store() could use Release, and the .load() below could use Acquire.

I wanted to keep this PR as simple as possible, so exploring different orderings is left for a subsequent PR.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 28, 2022

Codecov Report

Merging #23970 (1228935) into master (2da4e3e) will decrease coverage by 0.0%.
The diff coverage is 74.5%.

@@            Coverage Diff            @@
##           master   #23970     +/-   ##
=========================================
- Coverage    81.8%    81.7%   -0.1%     
=========================================
  Files         581      585      +4     
  Lines      158312   159293    +981     
=========================================
+ Hits       129518   130230    +712     
- Misses      28794    29063    +269     

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Mar 29, 2022

@brooksprumo thanks for digging into this, have to admit it's quite a subtle bug in a poorly designed code path.

This fix though probably wont resolve the issue, because any bank that exists in BankForks::banks():

self.banks().values().for_each(|bank| {
will not be dropped, since a reference exists in BankForks

This probably means whichever bank that's being dropped and causing the panic:

  1. Is some bank that was removed/never added to BankForks on startup from load_frozen_forks, probably because it was replayed from the ledger and older than the current root.
  2. Some other thread was keeping a reference to that bank alive, maybe because it was sent the bank from here:
    if let Err(e) = self
    .sender
    .send(TransactionStatusMessage::Batch(TransactionStatusBatch {
    bank,
    transactions,
    execution_results,
    balances,
    token_balances,
    rent_debits,
    }))
    or here:
    .send(bank.clone())
  3. That thread dropped the reference while we were booting the rest of the validator, hence causing the panic to happen

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Mar 29, 2022

It's easy to make the panic go away if we only check the is_abs when both:

  1. is_bank_drop_callback_enabled (already exists)
  2. a drop callback is enabled on this bank (i.e. we explicitly set it) on startup (this would be the new check)

But not sure if this is the correct thing to do. The decision hinges on whether or not running Bank::drop() outside of ABS for these old slots while the validator is running is safe. Will have to reason through the PR #17319 where it was introduced again, specifically this comment chain where it was proposed: #17319 (comment)

Namely it's suggested that this assert:

// After handling the reclaimed entries, this slot's
// storage entries should be purged from self.storage
assert!(self.storage.get_slot_stores(remove_slot).is_none());
can trigger when racing with clean_accounts(), and both do a partial clean of this slot. In this case, the purge_slots() call will not completely remove the slot, and the assert will trigger

@brooksprumo
Copy link
Copy Markdown
Contributor Author

@carllin Yep, you're 100% right! The TransactionStatusService's sender is keeping these old banks alive. Here's more specifics in the issue: #23976 (comment)

Independently, @mvines is working on #23852, which may resolve all this by removing the special-case nature of blockstore processing during startup.

I agree that this PR does not fix the underlying panic (as described in #23976). I also think this PR is still valuable on its own, as it closes a potential race condition.

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Mar 29, 2022

Independently, @mvines is working on #23852, which may resolve all this by removing the special-case nature of blockstore processing during startup.

There will always be the potential references exist to banks outside of BankForks, which is the root cause of this issue. It's actually probably good this panicked and caught that this was occurring 😃

  1. We need to figure out the potential race exists with AccountsDb::clean() if we drop these banks outside of ABS
  2. Reason about a proper fix

@brooksprumo let's dig into 1. first, starting with the code mentioned in my comment above

I agree that this PR does not fix the underlying panic (as described in #23976). I also think this PR is still valuable on its own, as it closes a potential race condition.

There's no race for the banks that exist in BankForks because we always have a reference to them in BankForks, i.e. they can't be dropped by another thread

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Mar 30, 2022

@brooksprumo I think it's actually unsafe to have these stray Bank::drop()'s handled outside of AccountsBackgroundService, so I propose this change here: #23999

@brooksprumo
Copy link
Copy Markdown
Contributor Author

If PR #24142 gets merged, this PR will be obviated (for v1.11) and closed.

@stale
Copy link
Copy Markdown

stale Bot commented Apr 16, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 16, 2022
@stale
Copy link
Copy Markdown

stale Bot commented Apr 27, 2022

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale Bot closed this Apr 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

bug Something isn't working stale [bot only] Added to stale content; results in auto-close after a week.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants