Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Refactor purge_slots_from_cache_and_store() and handle_reclaims()#17319

Merged
carllin merged 7 commits intosolana-labs:masterfrom
carllin:RefactorPurge
May 24, 2021
Merged

Refactor purge_slots_from_cache_and_store() and handle_reclaims()#17319
carllin merged 7 commits intosolana-labs:masterfrom
carllin:RefactorPurge

Conversation

@carllin
Copy link
Copy Markdown
Contributor

@carllin carllin commented May 19, 2021

Problem

To support the dumping logic in #17269, remove_unrooted_slot()

pub fn remove_unrooted_slot(&self, remove_slot: Slot) {
needs to for an unrooted slot S:

  1. Remove all the accounts index entries for S
  2. Remove all the storage entries S

To this end, it would be nice if remove_unrooted_slot() and other purge logic like purge_slots():

let non_roots: Vec<&Slot> = slots
.iter()
.filter(|slot| !self.accounts_index.is_root(**slot))
.collect();
safety_checks_elapsed.stop();
self.external_purge_slots_stats
.safety_checks_elapsed
.fetch_add(safety_checks_elapsed.as_us(), Ordering::Relaxed);
self.purge_slots_from_cache_and_store(
true,
non_roots.into_iter(),
&self.external_purge_slots_stats,
);
(called on Bank::drop() for unrooted banks), could share the same core purging logic by. both calling into purge_slots_from_cache_and_store():
fn purge_slots_from_cache_and_store<'a>(
&'a self,
can_exist_in_cache: bool,
removed_slots: impl Iterator<Item = &'a Slot>,
purge_stats: &PurgeStats,
to handle all the associated purging logic.

Currently however purge_slots_from_cache_and_store(), if the slot to be purged is not in the cache, it does not purge the AccountsIndex entries, only the storage entries themselves:

// Note this only cleans up the storage entries. The accounts index cleaning
// (removing from the slot list, decrementing the account ref count), is handled in
// clean_accounts() -> purge_older_root_entries()
{

Summary of Changes

  1. In order to introduce accounts index purging into purge_slots_from_cache_and_store(slot) (see Problem 1 above), we would ideally like to reuse the purge_exact() -> handle_reclaims() (similar to in clean_accounts) flow to completely remove the slot.

    However, the current problem is handle_reclaims() itself calls into purge_slots_from_cache_and_store() to delete the storage entries, via the path handle_reclaims() -> process_dead_slots() -> purge_storage_slots() -> purge_slots_from_cache_and_store(), so we can't currently call handle_reclaims() inside purge_slots_from_cache_and_store().

    To fix this first we first note that handle_reclaims() only ever reclaims entries not in the cache. This means if we factor out the storage deletion logic from purge_slots_from_cache_and_store() into a separate purge_slot_storage_entries(), then we can break. this cycle by having both purge_slots_from_cache_and_store() and handle_reclaims() -> purge_storage_slots() call into purge_slot_storage_entries(): https://github.com/solana-labs/solana/compare/master...carllin:RefactorPurge?expand=1#diff-1090394420d51617f3233275c2b65ed706b35b53b115fe65f82c682af8134a6fR3112

  2. Now that purge_slots_from_cache_and_store() completely removes the accounts index entry for unrooted banks, we no longer need the logic in Bank::drop() to add the dirty keys: https://github.com/solana-labs/solana/compare/master...carllin:RefactorPurge?expand=1#diff-ed47b4a0198313377e091bb3957bbbc63d937805426d1b2b6de39d0a50d32a0cL5120-L5130
    @brooksprumo

Fixes #

Comment thread runtime/src/bank.rs
@carllin carllin force-pushed the RefactorPurge branch 2 times, most recently from dab089c to 398748c Compare May 19, 2021 08:02

let ancestors: Ancestors = vec![(slot, 0)].into_iter().collect();
let t_do_load =
start_load_thread(with_retry, ancestors, db, exit.clone(), pubkey, move |_| {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun from what I remember, these tests were essentially testing that if we try to load on a purged note, we panic.

This was because the behavior before was we only removed the storage entry in purge_slot() and left the accounts index entry untouched. This meant we would fail to retrieve the account, retry the get from the index, see the same index entry again and panic when we got the same index entry:

if new_slot == slot && new_store_id == store_id {
// Considering that we're failed to get accessor above and further that
// the index still returned the same (slot, store_id) tuple, offset must be same
// too.
assert!(new_offset == offset);
// If the entry was missing from the cache, that means it must have been flushed,
// and the accounts index is always updated before cache flush, so store_id must
// not indicate being cached at this point.
assert!(new_store_id != CACHE_VIRTUAL_STORAGE_ID);
// If this is not a cache entry, then this was a minor fork slot
// that had its storage entries cleaned up by purge_slots() but hasn't been
// cleaned yet. That means this must be rpc access and not replay/banking at the
// very least. Note that purge shouldn't occur even for RPC as caller must hold all
// of ancestor slots..
assert!(load_hint == LoadHint::Unspecified);
// Everything being assert!()-ed, let's panic!() here as it's an error condition
// after all....
// That reasoning is based on the fact all of code-path reaching this fn
// retry_to_get_account_accessor() must outlive the Arc<Bank> (and its all
// ancestors) over this fn invocation, guaranteeing the prevention of being purged,
// first of all.
// For details, see the comment in AccountIndex::do_checked_scan_accounts(),
// which is referring back here.
panic!(
"Bad index entry detected ({}, {}, {}, {}, {:?})",
pubkey, slot, store_id, offset, load_hint
);

With the current set of changes where the purge_slot() call above actually removes the account index entry, which means the load thread here will realize the account is deleted and will not retry the account index get, returning None. This causes the load thread to panic here:

let loaded_account = db.do_load(&ancestors, &pubkey, None, load_hint).unwrap();
since the pubkey no longer exists in the index.

Essentially it is no longer possible for the panic to happen because the account index entry is removed before the storage entry is removed, so I have removed these tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin your reasoning sounds good at first glance. let me think on it more thoughly though. btw, another pretty mind-bending pr from you. :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this your comment is finally making sense... what a brain-teasing pr... lol

}

#[test]
#[ignore]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay! less work (ref #16638) :)

Copy link
Copy Markdown
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of changes in there that I don't fully grok yet, so I'll defer to others on approval. I've added tiny nits, mostly comments because I think they are helpful. I'll revisit this PR and add additional comments if I find anything else.

Comment thread runtime/src/accounts_db.rs Outdated
Comment thread runtime/src/accounts_db.rs Outdated
Comment thread runtime/src/accounts_db.rs Outdated
.fetch_add(handle_reclaims_elapsed.as_us(), Ordering::Relaxed);
// After handling the reclaimed entries, this slot's
// storage entries should be purged from self.storage
assert!(self.storage.get_slot_stores(*remove_slot).is_none());
Copy link
Copy Markdown
Contributor Author

@carllin carllin May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is, this can actually race with clean_accounts(). If clean removes some of the the accounts index entries first, then we wont remove them here, and the handle_reclaims() call here may not remove all the storage entries (clean may have the remaining reclaims from the index necessary to mark the slot as dead).

Luckily right now AccountsBackgroundService runs all Bank::drop() serially with clean_accounts(), but this is the reason the test_store_scan_consistency_unrooted() was failing, which has been fixed in b062124

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily right now AccountsBackgroundService runs all Bank::drop() serially with clean_accounts()

With the upcoming duplicate slot thing, how about introducing a runtime check to really validate this?

I think we can signal AccountsDb here at the identical timing to setup callbacks for all banks:

solana/core/src/tvu.rs

Lines 237 to 240 in 44831c1

// Before replay starts, set the callbacks in each of the banks in BankForks
for bank in bank_forks.read().unwrap().banks().values() {
bank.set_callback(Some(Box::new(SendDroppedBankCallback::new(
pruned_banks_sender.clone(),

so that all subsequent purge_slot() must be coming from the ABS thread, not directly from Bank::drop()-ed thread.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean, I think this property could tend to be broken in the future by other code juggling. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, hmm so currently Bank::new_from_parent(), already guarantees that all child banks will have the proper callback.

I think we can signal AccountsDb here at the identical timing to setup callbacks

So something to verify from purge_slot() that the caller is ABS thread, i.e. something like checking the thread id is equal to the ABS thread?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something to verify from purge_slot() that the caller is ABS thread, i.e. something like checking the thread id is equal to the ABS thread?

@carllin Oh, I was imaging more simpler way than thread id like this:

let callback = accounts_db.create_drop_callback(); // internally set `.drop_callback_installed` to `true`.
// Before replay starts, set the callbacks in each of the banks in BankForks 
for bank in bank_forks.read().unwrap().banks().values() { 
    bank.set_callback(callback)
}


impl AccountsDb {
   fn purge_slot(from_abs: bool) {
     if self.drop_callback_installed && !from_abs {
        panic!("bad drop callpath detected; bank.drop must be serialized")
     }
   }
}

impl Drop for Bank {
  fn drop() {
     accounts_db.purge_slot(..., false)
  }
}

impl ABS {
    accounts_db.purge_slot(..., true);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway this is optional nicety.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2021

Codecov Report

Merging #17319 (146b369) into master (662c2aa) will decrease coverage by 0.0%.
The diff coverage is 96.9%.

@@            Coverage Diff            @@
##           master   #17319     +/-   ##
=========================================
- Coverage    82.7%    82.7%   -0.1%     
=========================================
  Files         424      424             
  Lines      118444   118478     +34     
=========================================
+ Hits        97994    98010     +16     
- Misses      20450    20468     +18     

@ryoqun ryoqun self-requested a review May 20, 2021 13:21
@ryoqun
Copy link
Copy Markdown
Contributor

ryoqun commented May 21, 2021

so, new callgraph looks like this..

image

.fetch_add(recycle_stores_write_elapsed, Ordering::Relaxed);
}

fn purge_storage_slots(&self, removed_slots: &HashSet<Slot>) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memo: purge_slot_storage_entries is the new equivalent fn called in process_dead_slots.

Comment thread runtime/src/accounts_db.rs Outdated

/// Purge the backing storage entries for the given slot, does not purge from
/// the cache!
fn purge_slot_storage_entries<'a>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Comment thread runtime/src/accounts_db.rs Outdated

let mut purge_removed_slots = Measure::start("reclaims::purge_removed_slots");
self.purge_storage_slots(&dead_slots);
self.purge_slot_storage_entries(dead_slots.iter(), purge_stats);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -3176,24 +3303,6 @@ impl AccountsDb {
.fetch_add(recycle_stores_write_elapsed, Ordering::Relaxed);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memo: end of purge_slots_from_cache_and_store in old code.

Comment thread runtime/src/accounts_db.rs Outdated
Comment thread runtime/src/accounts_db.rs
@ryoqun
Copy link
Copy Markdown
Contributor

ryoqun commented May 21, 2021

Problem

To support the dumping logic in #17269, remove_unrooted_slot() ....

....

btw, this pr description is finally making sense too. and indeed, it turns out very crisp and consumable explanation. your efforts are well deserved for writing words, not code for human. :)

Copy link
Copy Markdown
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is making a lot more sense now; thanks for explaining it in our call! Again, my comments are just around naming and documenting. Everything else is looks good to me!

I also like @ryoqun's suggestion of extracting out the code into a new function like purge_slot_storage().

Comment thread runtime/src/accounts_index.rs
Comment thread runtime/src/accounts_db.rs Outdated
Comment thread runtime/src/accounts_db.rs Outdated
Comment thread runtime/src/accounts_db.rs Outdated
// From 1) and 2) we guarantee passing Some(slot), true is safe
// From 1) and 2) we guarantee passing `purge_stats` == None, which is
// equivalent to asserting there will be no dead slots, is safe.
let purge_stats = None;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: no_purge_stats.

clean_dead_slots.stop();

let mut purge_removed_slots = Measure::start("reclaims::purge_removed_slots");
self.purge_storage_slots(&dead_slots);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memo: so, this is the pivotal code change to cut the circular problem.

@ryoqun
Copy link
Copy Markdown
Contributor

ryoqun commented May 24, 2021

so, this is the updated call tree:

image

ryoqun
ryoqun previously approved these changes May 24, 2021
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits!

@mergify mergify Bot dismissed ryoqun’s stale review May 24, 2021 05:29

Pull request has been modified.

Comment thread runtime/src/bank.rs
// 2. At startup when replaying blockstore and there's no
// AccountsBackgroundService to perform cleanups yet.
self.rc.accounts.purge_slot(self.slot());
self.rc.accounts.purge_slot(self.slot(), false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Contributor

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm again!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants