Discard pre hard fork persisted tower if hard-forking#13527
Discard pre hard fork persisted tower if hard-forking#13527ryoqun wants to merge 4 commits intosolana-labs:masterfrom
Conversation
| if let Some(wait_slot_for_supermajority) = config.wait_for_supermajority { | ||
| if root_bank.slot() == wait_slot_for_supermajority { // <= is this check too strict? | ||
| // intentionally fail to restore tower; we're supposedly in a new hard fork; past | ||
| // out-of-chain vote state doesn't make sense at all |
There was a problem hiding this comment.
hmm. we need to check last_voted_slot?
There was a problem hiding this comment.
This is probably fine for now, but how about storing the shred version in the persistent vote file? I think that might be more robust, we can just discard it if the shred version has a mismatch.
|
@carllin Could you review this? :) |
Codecov Report
@@ Coverage Diff @@
## master #13527 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 378 378
Lines 90603 90628 +25
=========================================
- Hits 74435 74423 -12
- Misses 16168 16205 +37 |
| // intentionally fail to restore tower; we're supposedly in a new hard fork; past | ||
| // out-of-chain vote state doesn't make sense at all | ||
| // what if --wait-for-supermajority again if the validator restarted? | ||
| warn!("bla bla"); |
There was a problem hiding this comment.
well added proper message.
|
Should the persisted tower be deleted/overwritten with default in the hard fork case (since it's a hard fork we don't care about the votes anyways)? The tower would be overwritten on the first vote, but there's still a small window. I'm imagining an edge case where somebody starts up with |
| fn create_restart_context( | ||
| &mut self, | ||
| pubkey: &Pubkey, | ||
| cluster_validator_info: &mut ClusterValidatorInfo, | ||
| ) -> (solana_core::cluster_info::Node, Option<ContactInfo>); | ||
| fn restart_node_with_context( | ||
| cluster_validator_info: ClusterValidatorInfo, | ||
| restart_context: (solana_core::cluster_info::Node, Option<ContactInfo>), | ||
| ) -> ClusterValidatorInfo; | ||
| fn add_restarted_node(&mut self, pubkey: &Pubkey, cluster_validator_info: ClusterValidatorInfo); |
There was a problem hiding this comment.
some ugly api is needed to work around blocking Validaotor::new. ;)
| sleep(Duration::from_millis(100)); | ||
|
|
||
| if let Some(root) = root_in_tower(&val_a_ledger_path, &validator_a_pubkey) { | ||
| if root >= 15 { |
There was a problem hiding this comment.
Let's call this root min_root, and then the hard_fork_slot below can be min_root - 5 to indicate the purpose is to be less than min_root
|
|
||
| #[test] | ||
| #[serial] | ||
| fn test_hard_fork() { |
There was a problem hiding this comment.
there was no test or whatever for hard fork code.. hence this generic test name can be justified.
| // should have been filtered out, as they all have a descendant, | ||
| // namely the `last_vote` itself. | ||
| assert!(!last_vote_ancestors.contains(candidate_slot)); | ||
| if !self.is_stray_last_vote() { |
There was a problem hiding this comment.
Investigating why this is necessary, hopefully just a test issue 🙏
There was a problem hiding this comment.
@ryoqun, I commented this check out, but haven't been able to repro the failure you described. The test seems to be passing fine on several different machines.
| // intentionally fail to restore tower; we're supposedly in a new hard fork; past | ||
| // out-of-chain vote state doesn't make sense at all | ||
| // what if --wait-for-supermajority again if the validator restarted? | ||
| let message = format!("Hardfork is detected; discarding tower restoration result: {:?}", tower); |
There was a problem hiding this comment.
oops! Debug of tower can be really big. it might not be wise to put in the datapoint_error!, only add error! below.
There was a problem hiding this comment.
nits: include wait_slot_for_supermajority instead.
|
superseded by #13536 |
Problem
The tds relaunch failed. The theory is like this:
Substantial validator (~20-30%, rough guess from chronograf) has rooted newer slots than the hard fork slot (4676591) and kept the ledger. In that case, validator cannot vote because restored tower isn't adjusted for the hard fork and it prevents voting at all because of switch threshold failure. Finally this makes liveness under supermajority.
voting node log sample:
Summary of Changes
Discard old persited tower if hard-forking.
Fixes #