Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Discard pre hard fork persisted tower if hard-forking#13527

Closed
ryoqun wants to merge 4 commits intosolana-labs:masterfrom
ryoqun:sorry-my-persisted-tower-killed-tds
Closed

Discard pre hard fork persisted tower if hard-forking#13527
ryoqun wants to merge 4 commits intosolana-labs:masterfrom
ryoqun:sorry-my-persisted-tower-killed-tds

Conversation

@ryoqun
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun commented Nov 11, 2020

Problem

The tds relaunch failed. The theory is like this:

Substantial validator (~20-30%, rough guess from chronograf) has rooted newer slots than the hard fork slot (4676591) and kept the ledger. In that case, validator cannot vote because restored tower isn't adjusted for the hard fork and it prevents voting at all because of switch threshold failure. Finally this makes liveness under supermajority.
voting node log sample:

[2020-11-11T08:02:10.893472532Z INFO  solana_core::validator] Waiting for 80% of activated stake at slot 46765918 to be in gossip...
[2020-11-11T08:02:10.781175975Z ERROR solana_core::consensus] For some reason, we're REPROCESSING slots which has already been voted and ROOTED by us; VOTING will be SUSPENDED UNTIL 46766871!
[2020-11-11T13:09:32.455506994Z INFO  solana_core::replay_stage] Waiting to switch vote to 46766495, resetting to slot None on same fork for now
[2020-11-11T13:09:33.192110752Z INFO  solana_core::replay_stage] Waiting to switch vote to 46766980, 
resetting to slot None on same fork for now
[2020-11-11T13:09:33.192139724Z INFO  solana_core::replay_stage] Couldn't vote on heaviest fork: 46766980, heaviest_fork_failures: [FailedSwitchThreshold(46766980)]
$ TZ=UTC ls -l ledger/tower-twP716sMjTFynrZRSCKkHhoeHaQEXnR2bzqDEzuxP81.bin 
-rw-r--r-- 1 sol sol 2360 Nov  9 04:17 ledger/tower-twP716sMjTFynrZRSCKkHhoeHaQEXnR2bzqDEzuxP81.bin # tower not saved after the restart at all.

Summary of Changes

Discard old persited tower if hard-forking.

  • Think more to ensure to avoid false positive/false negative...
  • Write a test.

Fixes #

@ryoqun ryoqun requested a review from carllin November 11, 2020 16:31
Comment thread core/src/validator.rs
if let Some(wait_slot_for_supermajority) = config.wait_for_supermajority {
if root_bank.slot() == wait_slot_for_supermajority { // <= is this check too strict?
// intentionally fail to restore tower; we're supposedly in a new hard fork; past
// out-of-chain vote state doesn't make sense at all
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. we need to check last_voted_slot?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine for now, but how about storing the shred version in the persistent vote file? I think that might be more robust, we can just discard it if the shred version has a mismatch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. makes sense.

@ryoqun ryoqun added the v1.4 label Nov 11, 2020
@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Nov 11, 2020

@carllin Could you review this? :)

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 11, 2020

Codecov Report

Merging #13527 (4f6e3b1) into master (58354d1) will decrease coverage by 0.0%.
The diff coverage is 12.5%.

@@            Coverage Diff            @@
##           master   #13527     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         378      378             
  Lines       90603    90628     +25     
=========================================
- Hits        74435    74423     -12     
- Misses      16168    16205     +37     

Comment thread core/src/validator.rs Outdated
// intentionally fail to restore tower; we're supposedly in a new hard fork; past
// out-of-chain vote state doesn't make sense at all
// what if --wait-for-supermajority again if the validator restarted?
warn!("bla bla");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well added proper message.

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Nov 11, 2020

Should the persisted tower be deleted/overwritten with default in the hard fork case (since it's a hard fork we don't care about the votes anyways)? The tower would be overwritten on the first vote, but there's still a small window. I'm imagining an edge case where somebody starts up with wait_for_supermajority, then shuts down before the validator votes, then starts up again without wait_for_supermajority

@ryoqun ryoqun marked this pull request as ready for review November 11, 2020 20:26
Comment on lines +43 to +52
fn create_restart_context(
&mut self,
pubkey: &Pubkey,
cluster_validator_info: &mut ClusterValidatorInfo,
) -> (solana_core::cluster_info::Node, Option<ContactInfo>);
fn restart_node_with_context(
cluster_validator_info: ClusterValidatorInfo,
restart_context: (solana_core::cluster_info::Node, Option<ContactInfo>),
) -> ClusterValidatorInfo;
fn add_restarted_node(&mut self, pubkey: &Pubkey, cluster_validator_info: ClusterValidatorInfo);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some ugly api is needed to work around blocking Validaotor::new. ;)

sleep(Duration::from_millis(100));

if let Some(root) = root_in_tower(&val_a_ledger_path, &validator_a_pubkey) {
if root >= 15 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this root min_root, and then the hard_fork_slot below can be min_root - 5 to indicate the purpose is to be less than min_root


#[test]
#[serial]
fn test_hard_fork() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was no test or whatever for hard fork code.. hence this generic test name can be justified.

Comment thread core/src/consensus.rs
// should have been filtered out, as they all have a descendant,
// namely the `last_vote` itself.
assert!(!last_vote_ancestors.contains(candidate_slot));
if !self.is_stray_last_vote() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigating why this is necessary, hopefully just a test issue 🙏

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, I commented this check out, but haven't been able to repro the failure you described. The test seems to be passing fine on several different machines.

Comment thread core/src/validator.rs
// intentionally fail to restore tower; we're supposedly in a new hard fork; past
// out-of-chain vote state doesn't make sense at all
// what if --wait-for-supermajority again if the validator restarted?
let message = format!("Hardfork is detected; discarding tower restoration result: {:?}", tower);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! Debug of tower can be really big. it might not be wise to put in the datapoint_error!, only add error! below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: include wait_slot_for_supermajority instead.

@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Nov 12, 2020

superseded by #13536

@ryoqun ryoqun closed this Nov 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants