Avoid to miss to root for local slots before the hard fork#19912
Avoid to miss to root for local slots before the hard fork#19912ryoqun merged 7 commits intosolana-labs:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #19912 +/- ##
=========================================
- Coverage 82.1% 81.9% -0.3%
=========================================
Files 628 631 +3
Lines 171471 174118 +2647
=========================================
+ Hits 140878 142615 +1737
- Misses 30593 31503 +910 |
fd0abd5 to
f616d28
Compare
| reconcile_blockstore_roots_with_external_source( | ||
| ExternalRootSource::HardFork(hard_fork_restart_slot), | ||
| &blockstore, | ||
| &mut last_blockstore_root, | ||
| ) |
There was a problem hiding this comment.
so this is the meat of this pr
| "Reconciling slots as root based on external root: {:?} ({}..{}) ", | ||
| new_roots, external_root, last_blockstore_root | ||
| ); | ||
| blockstore.set_roots(new_roots.iter())?; |
There was a problem hiding this comment.
@carllin I wonder this .set_roots(...) should be accompanied with set_duplicate_confirmed_slots_and_hashes(...) like in blockstore_processor?:
solana/ledger/src/blockstore_processor.rs
Line 1169 in 5bbb0da
There was a problem hiding this comment.
@ryoqun ah yeah it should, good catch!
Although I think you'll see that there are no calls to fn set_duplicate_confirmed_slots_and_hashes in 1.6 and 1.7 currently 😄
There was a problem hiding this comment.
after some code reading, i had to make a kind of compromise regarding the treatment of duplicate confirmed slots...: https://github.com/solana-labs/solana/pull/19912/files#r902224585
so, i've defined small fn (mark_slots_as_if_rooted_normally_at_startup) for the sole sake of documentation to make the nuance very explicit, to my taste. i hope it'll be more immune to comment rots. ;)
db859c1 to
195f20a
Compare
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
@ryoqun , any chance of resuscitating this?? |
@CriesofCarrots thanks for caring this pr. yeah, i haven't forgetting this pr at all... so stressful that i can't get to this for this extended period of time.... I'm planning to work on this, this week really. |
350c018 to
cb4e410
Compare
0ef68ea to
c924ff8
Compare
|
status: I resuscitated this pr with all review so far being addressed. I'll write replies to review comments later. |
| // Unfortunately, we can't supply duplicate-confirmed hashes, | ||
| // because it can't be guaranteed to be able to replay these slots | ||
| // under this code-path's limited condition (i.e. those shreds | ||
| // might not be available, etc...) also correctly overcoming this | ||
| // limitation is hard... | ||
| blockstore.mark_slots_as_if_rooted_normally_at_startup( | ||
| new_roots.into_iter().map(|root| (root, None)).collect(), | ||
| false, | ||
| )?; |
There was a problem hiding this comment.
hmmm yeah this is yet another tricky edge case to reason about, but I think it should be ok.
Not marking these slots as duplicate confirmed just means:
- we can't serve requests to people asking for the correct version of these slots via
AncestorHashesRepairType - Should be safe as long as we're not freezing the slot again during replay (not happening b/c rooted)
| // blockstore.last_root() might have been updated already. | ||
| // so take a &mut param both to input (and output iff we update root) | ||
| last_blockstore_root: &mut Slot, |
| )), | ||
| &poh_recorder, | ||
| tower, | ||
| None, |
| info!("Tower state: {:?}", tower); | ||
| tower | ||
| } else { | ||
| warn!("creating default tower...."); |
| } | ||
| } | ||
|
|
||
| impl<'a> From<ProcessBlockStore<'a>> for Tower { |
There was a problem hiding this comment.
@mvines I don't think this is a correct use of From, which is introduced at #23852 (https://github.com/solana-labs/solana/pull/23852/files#r902237878).
This is converting quite different things. ideally, From should be a cast for conceptually similar things. Also, this made code reading a bit hard because i can't easily spot when ProcessBlockStore::process is exactly executed now. (i.e. searching the correct .into() is hard)
it seems this trait workaround is only needed for this test test_tvu_exit and i think alternative approach should arguably be tolerable (still not ideal, i skipped properly introducing enum or trait just for this purpose...): https://github.com/solana-labs/solana/pull/19912/files#r902243960
btw, i didn't know there is an identity/no-op blanket From/Into impl provided here
https://doc.rust-lang.org/src/core/convert/mod.rs.html#559:
// From (and thus Into) is reflexive
#[stable(feature = "rust1", since = "1.0.0")]
#[rustc_const_unstable(feature = "const_convert", issue = "88674")]
impl<T> const From<T> for T {
/// Returns the argument unchanged.
fn from(t: T) -> T {
t
}
}granted, this refactor isn't relevant to this pr but i couldn't resist the urge, when i was code-reading to guarantee the safety of the scary blockstore mutation with hard fork params.
happy to separate this change into another preparatory pr, if you desire. :)
There was a problem hiding this comment.
Sure, this looks fine to me
this is finally done! @carllin could you review this? cc: @CriesofCarrots |
|
sanity checked and got: seems working properly. |
|
seems this worked for 2023-02-26 mainnet-beta outage: |
| assert_eq!(&slots_a[slots_a.len() - roots_a.len()..].to_vec(), &roots_a); | ||
| assert_eq!(&slots_b[slots_b.len() - roots_b.len()..].to_vec(), &roots_b); |
| /// Dangerous. Currently only needed for a local-cluster test | ||
| pub fn unset_parent(&mut self) { | ||
| self.parent_slot = None; | ||
| } | ||
|
|
Problem
some rooted slots could be omitted to be marked.
some rooted slots could be omitted after restart (hard fork), if the hard fork slot is ahead of the validator's latest ledger root before the restart.
This is usually the case because current practice of cluster restart is based on the opt-conf-ed slot. (i.e. no node should have rooted that slot yet)
ultimately, this leads to hole in bigtable.
cc: @mvines @sakridge @carllin
Summary of Changes
fix it. (this is wip; I'm working on to fix the actual hole now).impl finishedrelated: #13536