Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Support out of band dumping of unrooted slots in AccountsDb#17269

Merged
mergify[bot] merged 8 commits intosolana-labs:masterfrom
carllin:DumpAccounts
Jun 2, 2021
Merged

Support out of band dumping of unrooted slots in AccountsDb#17269
mergify[bot] merged 8 commits intosolana-labs:masterfrom
carllin:DumpAccounts

Conversation

@carllin
Copy link
Copy Markdown
Contributor

@carllin carllin commented May 17, 2021

Problem

Follow-up to #17319.

Bad versions of duplicate blocks can be replayed, and thus need to be cleared from Accounts state before replaying the good version.

Summary of Changes

Support safe block dumping by:

  1. Making sure cache flushes for the same block are not happening concurrently via the RemoveUnrootedSlots synchronization mechanism which keeps a list of slots being flushed from cache OR being removed by this new dumping path. This dumping path blocks until a flush is completed, whereas the cache flushing path will ignore requests to flush slots that are currently/about to be dumped.
  2. Removing the slot completely from both the AccountsIndex and removing the storage entries by calling purge_slots_from_cache_and_store().

Follow-up PR:
Notifying relevant scans that their results may be outdated, and abort those results, noted here:

// TODO: This is currently unsafe with scan because it can remove a slot in the middle

Fixes #

@carllin carllin force-pushed the DumpAccounts branch 5 times, most recently from 9126345 to cf8ae73 Compare May 19, 2021 02:08
@carllin carllin force-pushed the DumpAccounts branch 2 times, most recently from c1b01f8 to 8ccb077 Compare May 22, 2021 03:08
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2021

Codecov Report

Merging #17269 (f421084) into master (a0d721c) will decrease coverage by 0.0%.
The diff coverage is 76.2%.

@@            Coverage Diff            @@
##           master   #17269     +/-   ##
=========================================
- Coverage    82.7%    82.7%   -0.1%     
=========================================
  Files         428      430      +2     
  Lines      119963   120568    +605     
=========================================
+ Hits        99324    99796    +472     
- Misses      20639    20772    +133     

@carllin carllin force-pushed the DumpAccounts branch 2 times, most recently from ef09ea7 to 8e29975 Compare May 24, 2021 21:28
@carllin carllin marked this pull request as ready for review May 24, 2021 21:36
@carllin carllin requested a review from lijunwangs May 27, 2021 23:07
// Reads will then always read the latest version of a slot. Scans will also know
// which version their parents because banks will also be augmented with this version,
// which handles cases where a deletion of one version happens in the middle of the scan.
// TODO: This is currently unsafe with scan because it can remove a slot in the middle
Copy link
Copy Markdown
Contributor

@jeffwashington jeffwashington May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your plan for this now? Does this unsafeness exist today and you're just calling it out here or does this PR add this unsafeness? Is the unsafeness worth whatever this is fixing? I'm missing some context.
I'm confused with 'currently'. Prior to your change, in master? Or, 'currently' because of your change?

I should have read the comment above better before commenting ;-) It was unsafe. It is still unsafe. Nothing to see here!

Copy link
Copy Markdown
Contributor Author

@carllin carllin May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this unsafeness does exist today, but luckily this function isn't called anywhere by the validator 😃

Yeah it's definitely worth fixing, and is addressed in this next PR here: #17471, a peek into the future if you will 👀

Copy link
Copy Markdown
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I made it through!

I'm still not at a point where I understand all the uses and places when flushes/purges/cleans/removals happen, so I'm not sure if there are other designs that would also work here. But this design looks good to me.

I had one question about a variable copy, but that's it. I can give an Approve after that's resolved.

Also, nice test; pretty gnarly!

Comment thread runtime/src/accounts_db.rs
Comment thread runtime/src/accounts_db.rs Outdated
Copy link
Copy Markdown
Contributor

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions.

Comment thread runtime/src/accounts_db.rs Outdated
// For each slot the cache flush has finished, mark that we're about to start
// purging these slots by reserving it in `contended_slots`.
contended_cache_flush_slots.retain(|flushing_slot| {
let is_being_flushed = contended_slots.contains(flushing_slot);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the beginning of the loop, contended_cache_flush_slots contains the slots only found in contended_slots.
When it wakes up, how come the slot in contended_cache_flush_slots is not found in contended_slots? Who removes it? If the flush thread is removing it, then would not we risk on spinning on the same slots here if adding it back? Or am I missing something?

Copy link
Copy Markdown
Contributor Author

@carllin carllin May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lijunwangs,

When it wakes up, how come the slot in contended_cache_flush_slots is not found in contended_slots

The signal.wait(contended_slots).unwrap() above releases the contended_slots and then regrabs and returns the locked list when the condition variable gets a signal, see documentation here: https://doc.rust-lang.org/std/sync/struct.Condvar.html#method.wait. So in the meantime, the flush thread may have finished flushing and removed it from the list.

When it wakes up, how come the slot in contended_cache_flush_slots is not found in contended_slots

So the flush thread is the thread removing the slots from contended_slots, which is the shared variable under the lock.

The contended_cache_flush_slots is a local snapshot of the slots in contended_slots at the time we checked it at the beginning of the function, and is never added to again after we filter from it, so it's always shrinking after each iteration of the retain above. We only add a slot to the contended_slots throughcontended_slots.insert(*flushing_slot) after we've confirmed the retain will remove the slot because !is_being_flushed is true.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks explaining. Still have a question. In line 3445, contended_cache_flush_slots only has the slots which is in the also in contended_slots. Doesn't line 3471 re-add it again?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha yeah it does, I don't think it's strictly necessary, but I wanted to uphold the invariant that any slot that is either undergoing flush or remove_unrooted_slots are both present in contended_slots, so we always have the working set of slots observable by any thread.

}
};
if !is_being_purged {
let flush_stats = self.do_flush_slot_cache(slot, &slot_cache, should_flush_f);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be racy. In the above section lock was taken on slots_under_contention, and then dropped. Cannot the other thread running remove_unrooted_slots goes in and actually purge it?

Copy link
Copy Markdown
Contributor Author

@carllin carllin May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we held the lock until we inserted here: remove_unrooted_slots.insert(slot);, then we dropped the lock.

The thread running the remove_unrooted_slots() function will block/be stuck looping on the condition variable until this flush thread removes the slot from the contended_slots list: https://github.com/solana-labs/solana/pull/17269/files#diff-1090394420d51617f3233275c2b65ed706b35b53b115fe65f82c682af8134a6fR3468

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe rename let mut remove_unrooted_slots to let mut slots_under_contention make it more clear.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, will do!

lijunwangs
lijunwangs previously approved these changes May 30, 2021
Copy link
Copy Markdown
Contributor

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for answering my questions.

@brooksprumo brooksprumo self-requested a review May 31, 2021 18:40
brooksprumo
brooksprumo previously approved these changes May 31, 2021
Copy link
Copy Markdown
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think the doc comment updates would help future-Brooks too!

@@ -753,6 +753,13 @@ impl RecycleStores {
}
}

Copy link
Copy Markdown
Contributor

@brooksprumo brooksprumo May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some doc comment like this?

Suggested change
/// Removing unrooted slots in Accounts Background Service needs to be synchronized with flushing slots from the Accounts Cache. This keeps track of those slots and the Mutex + Condvar for synchronization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment thread runtime/src/accounts_db.rs Outdated
Comment on lines +842 to +844
// Set of slots currently being flushed by `flush_slot_cache()` or removed
// by `remove_unrooted_slot()`. Used to ensure `remove_unrooted_slots(slots)`
// can safely clear the set of unrooted slots `slots`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the comments. Could they be /// instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@mergify mergify Bot dismissed stale reviews from lijunwangs and brooksprumo June 1, 2021 04:49

Pull request has been modified.

jeffwashington
jeffwashington previously approved these changes Jun 1, 2021
Copy link
Copy Markdown
Contributor

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread runtime/src/accounts_db.rs Outdated
Comment on lines +3428 to +3433
if !rooted_slots.is_empty() {
panic!(
"Trying to remove accounts for rooted slots {:?}",
rooted_slots
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

assert!(rooted_slots.is_empty(), "Trying to remove accounts for rooted slots {:?}", rooted_slots);

Comment thread runtime/src/accounts_db.rs Outdated
// Mark that we're about to delete this slot now
contended_slots.insert(*flushing_slot);
}
!is_being_flushed
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, this condition needs to be flipped?

i mean, we need to retain() elements of contended_cache_flush_slots which have yet to be flushed (i.e. is_being_flushed to re-evaluate its .empty()-ness later in this loop; slots that are still contain()-ed in contended_slots (and will be remove()-ed by the flusher thread later)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, great catch.

It's bad the test didn't catch this, I'll have to fix that up as well to be more robust

Copy link
Copy Markdown
Contributor Author

@carllin carllin Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the test didn't catch this because the remove_unrooted_slots() thread will not wake up from waiting on the condition variable, until the flush thread has finished flushing and removed the slot from the contended slots, at which point the remove_unrooted_slots() thread will itself add the slot to the contended slots list, and then exit on the next iteration of the loop.

Updated the test to have an extra thread that simulates spurious wake up calls on the condition variable and the test now fails appropriately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect. :)

Comment thread runtime/src/accounts_db.rs Outdated
Comment on lines +3449 to +3454
let is_cache_flushing_slot = contended_slots.contains(remove_slot);
if !is_cache_flushing_slot {
// Reserve the slots that we want to purge that aren't currently
// being flushed to prevent cache from flushing those slots in
// the future
contended_slots.insert(**remove_slot);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exclusive control-flow logic assumes same (detected as duplicate) remove_slot won't be ever given to remove_unrooted_slots() from multiple replayingstage threads. Is this correct assumption? Maybe replayingstage always process a slot at each time? How about commenting about the underlying assumption somewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is only one replay thread, and one version of the remove_slot needs to be removed before the next version can be replayed. Added a comment!

Comment thread runtime/src/accounts_db.rs Outdated

// For each slot the cache flush has finished, mark that we're about to start
// purging these slots by reserving it in `contended_slots`.
contended_cache_flush_slots.retain(|flushing_slot| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about this naming?:

remaining_contended_flush_slots.retain(|flush_slot|
  • remaining_ is added to better describe this variable is retained() iteratively until .is_empty().
  • flush_slot is used to be consistent with remove_slot for english grammar.

Comment thread runtime/src/accounts_db.rs Outdated
let mut contended_cache_flush_slots: Vec<Slot> = remove_slots
.iter()
.filter(|remove_slot| {
let is_cache_flushing_slot = contended_slots.contains(remove_slot);
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: rename to is_being_flushed to be consistent with https://github.com/solana-labs/solana/pull/17269/files#r641874795 (let is_being_flushed = contended_slots.contains(flushing_slot);) and better contrast with is_being_purged, too :)

imo, semantically same expression should result in same name unless there is good reason (like state transition between the occurences). In this case, I think its usage is same for both occurrences in that it's used to reserve slot with contended_slots for removal.

Comment thread runtime/src/accounts_db.rs Outdated

{
// Slots that are currently being flushed by flush_slot_cache()
let mut contended_slots = slots_under_contention.lock().unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: rename to currently_contended_slots to indicate volatility or mutation of this variable via CondVar?

struct RemoveUnrootedSlotsSynchronization {
// slots being flushed from the cache or being purged
slots_under_contention: Mutex<HashSet<Slot>>,
signal: Condvar,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, our first usage of Condvar in the codebase. :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I keep running into situations where this has almost been the piece I needed. Glad to know it is there.

@mergify mergify Bot dismissed stale reviews from brooksprumo and jeffwashington June 1, 2021 22:39

Pull request has been modified.

Comment on lines +3454 to +3472
// Note that the single replay thread has to remove a specific slot `N`
// before another version of the same slot can be replayed. This means
// multiple threads should not call `remove_unrooted_slots()` simultaneously
// with the same slot.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

ryoqun
ryoqun previously approved these changes Jun 2, 2021
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; nice solid work!

I like the logically chunked series of prs towards the duplicate slot implementation. :)

@mergify mergify Bot dismissed ryoqun’s stale review June 2, 2021 08:18

Pull request has been modified.

@carllin
Copy link
Copy Markdown
Contributor Author

carllin commented Jun 2, 2021

lgtm; nice solid work!

I like the logically chunked series of prs towards the duplicate slot implementation. :)

Thanks for the review!

@carllin carllin added the automerge Merge this Pull Request automatically once CI passes label Jun 2, 2021
@mergify mergify Bot merged commit bbcdf07 into solana-labs:master Jun 2, 2021
@carllin carllin added the v1.7 label Jun 7, 2021
mergify Bot pushed a commit that referenced this pull request Jun 7, 2021
* Accounts dumping logic

* Add test for interaction between cache flush and remove_unrooted_slot()

* Update comments

* Rename

* renaming

* Add more comments

* Renaming

* Fixup test and bad check

(cherry picked from commit bbcdf07)

# Conflicts:
#	runtime/src/accounts_db.rs
carllin added a commit that referenced this pull request Jun 7, 2021
* Accounts dumping logic

* Add test for interaction between cache flush and remove_unrooted_slot()

* Update comments

* Rename

* renaming

* Add more comments

* Renaming

* Fixup test and bad check

(cherry picked from commit bbcdf07)
mergify Bot added a commit that referenced this pull request Jun 7, 2021
…17777)

* Accounts dumping logic

* Add test for interaction between cache flush and remove_unrooted_slot()

* Update comments

* Rename

* renaming

* Add more comments

* Renaming

* Fixup test and bad check

(cherry picked from commit bbcdf07)

Co-authored-by: carllin <carl@solana.com>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

automerge Merge this Pull Request automatically once CI passes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants