Refactor purge_slots_from_cache_and_store() and handle_reclaims()#17319
Refactor purge_slots_from_cache_and_store() and handle_reclaims()#17319carllin merged 7 commits intosolana-labs:masterfrom
Conversation
dab089c to
398748c
Compare
|
|
||
| let ancestors: Ancestors = vec![(slot, 0)].into_iter().collect(); | ||
| let t_do_load = | ||
| start_load_thread(with_retry, ancestors, db, exit.clone(), pubkey, move |_| { |
There was a problem hiding this comment.
@ryoqun from what I remember, these tests were essentially testing that if we try to load on a purged note, we panic.
This was because the behavior before was we only removed the storage entry in purge_slot() and left the accounts index entry untouched. This meant we would fail to retrieve the account, retry the get from the index, see the same index entry again and panic when we got the same index entry:
solana/runtime/src/accounts_db.rs
Lines 2674 to 2703 in 305d9dd
With the current set of changes where the purge_slot() call above actually removes the account index entry, which means the load thread here will realize the account is deleted and will not retry the account index get, returning None. This causes the load thread to panic here:
solana/runtime/src/accounts_db.rs
Line 10184 in 305d9dd
Essentially it is no longer possible for the panic to happen because the account index entry is removed before the storage entry is removed, so I have removed these tests.
There was a problem hiding this comment.
@carllin your reasoning sounds good at first glance. let me think on it more thoughly though. btw, another pretty mind-bending pr from you. :)
There was a problem hiding this comment.
this your comment is finally making sense... what a brain-teasing pr... lol
| } | ||
|
|
||
| #[test] | ||
| #[ignore] |
brooksprumo
left a comment
There was a problem hiding this comment.
There's a lot of changes in there that I don't fully grok yet, so I'll defer to others on approval. I've added tiny nits, mostly comments because I think they are helpful. I'll revisit this PR and add additional comments if I find anything else.
| .fetch_add(handle_reclaims_elapsed.as_us(), Ordering::Relaxed); | ||
| // After handling the reclaimed entries, this slot's | ||
| // storage entries should be purged from self.storage | ||
| assert!(self.storage.get_slot_stores(*remove_slot).is_none()); |
There was a problem hiding this comment.
One thing to note is, this can actually race with clean_accounts(). If clean removes some of the the accounts index entries first, then we wont remove them here, and the handle_reclaims() call here may not remove all the storage entries (clean may have the remaining reclaims from the index necessary to mark the slot as dead).
Luckily right now AccountsBackgroundService runs all Bank::drop() serially with clean_accounts(), but this is the reason the test_store_scan_consistency_unrooted() was failing, which has been fixed in b062124
There was a problem hiding this comment.
Luckily right now AccountsBackgroundService runs all Bank::drop() serially with clean_accounts()
With the upcoming duplicate slot thing, how about introducing a runtime check to really validate this?
I think we can signal AccountsDb here at the identical timing to setup callbacks for all banks:
Lines 237 to 240 in 44831c1
so that all subsequent purge_slot() must be coming from the ABS thread, not directly from Bank::drop()-ed thread.
There was a problem hiding this comment.
i mean, I think this property could tend to be broken in the future by other code juggling. :)
There was a problem hiding this comment.
@ryoqun, hmm so currently Bank::new_from_parent(), already guarantees that all child banks will have the proper callback.
I think we can signal AccountsDb here at the identical timing to setup callbacks
So something to verify from purge_slot() that the caller is ABS thread, i.e. something like checking the thread id is equal to the ABS thread?
There was a problem hiding this comment.
So something to verify from purge_slot() that the caller is ABS thread, i.e. something like checking the thread id is equal to the ABS thread?
@carllin Oh, I was imaging more simpler way than thread id like this:
let callback = accounts_db.create_drop_callback(); // internally set `.drop_callback_installed` to `true`.
// Before replay starts, set the callbacks in each of the banks in BankForks
for bank in bank_forks.read().unwrap().banks().values() {
bank.set_callback(callback)
}
impl AccountsDb {
fn purge_slot(from_abs: bool) {
if self.drop_callback_installed && !from_abs {
panic!("bad drop callpath detected; bank.drop must be serialized")
}
}
}
impl Drop for Bank {
fn drop() {
accounts_db.purge_slot(..., false)
}
}
impl ABS {
accounts_db.purge_slot(..., true);
}
There was a problem hiding this comment.
anyway this is optional nicety.
Codecov Report
|
| .fetch_add(recycle_stores_write_elapsed, Ordering::Relaxed); | ||
| } | ||
|
|
||
| fn purge_storage_slots(&self, removed_slots: &HashSet<Slot>) { |
There was a problem hiding this comment.
memo: purge_slot_storage_entries is the new equivalent fn called in process_dead_slots.
|
|
||
| /// Purge the backing storage entries for the given slot, does not purge from | ||
| /// the cache! | ||
| fn purge_slot_storage_entries<'a>( |
|
|
||
| let mut purge_removed_slots = Measure::start("reclaims::purge_removed_slots"); | ||
| self.purge_storage_slots(&dead_slots); | ||
| self.purge_slot_storage_entries(dead_slots.iter(), purge_stats); |
There was a problem hiding this comment.
| @@ -3176,24 +3303,6 @@ impl AccountsDb { | |||
| .fetch_add(recycle_stores_write_elapsed, Ordering::Relaxed); | |||
| } | |||
There was a problem hiding this comment.
memo: end of purge_slots_from_cache_and_store in old code.
btw, this pr description is finally making sense too. and indeed, it turns out very crisp and consumable explanation. your efforts are well deserved for writing words, not code for human. :) |
| // From 1) and 2) we guarantee passing Some(slot), true is safe | ||
| // From 1) and 2) we guarantee passing `purge_stats` == None, which is | ||
| // equivalent to asserting there will be no dead slots, is safe. | ||
| let purge_stats = None; |
| clean_dead_slots.stop(); | ||
|
|
||
| let mut purge_removed_slots = Measure::start("reclaims::purge_removed_slots"); | ||
| self.purge_storage_slots(&dead_slots); |
There was a problem hiding this comment.
memo: so, this is the pivotal code change to cut the circular problem.
| // 2. At startup when replaying blockstore and there's no | ||
| // AccountsBackgroundService to perform cleanups yet. | ||
| self.rc.accounts.purge_slot(self.slot()); | ||
| self.rc.accounts.purge_slot(self.slot(), false); |


Problem
To support the dumping logic in #17269,
remove_unrooted_slot()solana/runtime/src/accounts_db.rs
Line 3230 in a3c0833
S:SSTo this end, it would be nice if
remove_unrooted_slot()and other purge logic likepurge_slots():solana/runtime/src/accounts_db.rs
Lines 3209 to 3221 in a3c0833
Bank::drop()for unrooted banks), could share the same core purging logic by. both calling intopurge_slots_from_cache_and_store():solana/runtime/src/accounts_db.rs
Lines 3068 to 3072 in a3c0833
Currently however
purge_slots_from_cache_and_store(), if the slot to be purged is not in the cache, it does not purge the AccountsIndex entries, only the storage entries themselves:solana/runtime/src/accounts_db.rs
Lines 3098 to 3101 in a3c0833
Summary of Changes
In order to introduce accounts index purging into
purge_slots_from_cache_and_store(slot)(see Problem 1 above), we would ideally like to reuse thepurge_exact() -> handle_reclaims()(similar to in clean_accounts) flow to completely remove theslot.However, the current problem is
handle_reclaims()itself calls intopurge_slots_from_cache_and_store()to delete the storage entries, via the pathhandle_reclaims() -> process_dead_slots() -> purge_storage_slots() -> purge_slots_from_cache_and_store(), so we can't currently callhandle_reclaims()insidepurge_slots_from_cache_and_store().To fix this first we first note that
handle_reclaims()only ever reclaims entries not in the cache. This means if we factor out the storage deletion logic frompurge_slots_from_cache_and_store()into a separatepurge_slot_storage_entries(), then we can break. this cycle by having bothpurge_slots_from_cache_and_store()andhandle_reclaims() -> purge_storage_slots()call intopurge_slot_storage_entries(): https://github.com/solana-labs/solana/compare/master...carllin:RefactorPurge?expand=1#diff-1090394420d51617f3233275c2b65ed706b35b53b115fe65f82c682af8134a6fR3112Now that
purge_slots_from_cache_and_store()completely removes the accounts index entry for unrooted banks, we no longer need the logic inBank::drop()to add the dirty keys: https://github.com/solana-labs/solana/compare/master...carllin:RefactorPurge?expand=1#diff-ed47b4a0198313377e091bb3957bbbc63d937805426d1b2b6de39d0a50d32a0cL5120-L5130@brooksprumo
Fixes #