Save tower state persistently by TristanDebrunner · Pull Request #7436 · solana-labs/solana

TristanDebrunner · 2019-12-11T22:35:12Z

Problem

The Tower state of a validator is lost if the validator stops/crashes. This could lead to the validator voting in a slashable way upon reboot.

Summary of Changes

Have replay stage save its Tower state when generating a new vote, before sending the vote to the cluster. Also have replay stage try to load the saved state when it starts.

Thanks to @sagar-solana for the original work on this.

Context

towards: #6936

codecov · 2019-12-11T23:45:48Z

Codecov Report

Merging #7436 into master will increase coverage by <.1%.
The diff coverage is 89.3%.

@@           Coverage Diff            @@
##           master   #7436     +/-   ##
========================================
+ Coverage    80.5%   80.5%   +<.1%     
========================================
  Files         253     253             
  Lines       55412   55636    +224     
========================================
+ Hits        44637   44839    +202     
- Misses      10775   10797     +22

sagar-solana

Looks ok to me. @carllin do we need to update working bank based on last voted on bank from tower or will all of that just automatically resolve itself?

TristanDebrunner · 2019-12-12T20:21:50Z

I discussed with @carllin and @CriesofCarrots, and the conclusion was that it was fine as is.

Pull request has been modified.

carllin · 2019-12-12T22:29:02Z

@sagar-solana the last voted on bank in tower might not exist after loading from a snapshot. I think iit should be fine, we can't do anything slashable with a working bank we haven't voted on.

carllin · 2019-12-12T23:02:54Z

+                            e
+                        );
+                }
+                Tower::new(&my_pubkey, &vote_account, &bank_forks.read().unwrap())


Tower::new() will try to initialize the heaviest bank in bank_forks as root. I'm not sure this is the correct thing to do if we can't find tower state/tower state is corrupted.

Thinking we should panic, ask for manual intervention (probably use private key to transfer funds out into a different account to avoid slashing).

@aeyakovenko, what do you think?

Snapshot recovery, basically needs to send a tx, assert we are on the current fork, then expire max censorship threshold

What is max censorship threshold?

@aeyakovenko can you elaborate on what you mean by "expire max censorship threshold"?

@TristanDebrunner wait for 2^threshold slots to start voting. This is necessary if the VoteState is reused between reboots. If it’s a new VoteState there is no chance of accidentally voting on a fork.

I assume you're referring to the threshold introduced in #6744?

It's part of the current implementation already.

solana/core/src/consensus.rs

Line 19 in e98132f

pub const VOTE_THRESHOLD_DEPTH: usize = 8;

How about changing this to return a Result<Tower>? Then if it returns Err, replay stage can check if the validator's vote account exists in the bank. If it does exist and isn't empty, we panic as @carllin suggested. Otherwise, it's a new validator, so we can use the default that currently gets returned silently.

Then once the relevant checks in #6936 are implemented, replay stage can wait instead of panicing

Should we also double check the signature to make sure the state wasn't corrupted?

@TristanDebrunner - is @carllin's question resolved? I see you asked for review again 7 days ago

sakridge

lgtm

carllin · 2020-01-06T21:41:04Z

+                    .get_account(&vote_account)
+                {
+                    if let Some(vote_state) = VoteState::from(&account) {
+                        if !vote_state.votes.is_empty() {


Hmmm, just because the vote state is empty is no guarantee that this validator hasn't already voted right? This validator could have voted and the votes didn't land in a bank yet, which opens the door for slashing if:

The tower state is corrupt (which is true by the time this logic executes)

The votes land later

@carllin the validator can wait MAX_RECENT_BLOCKHASHES before it starts voting to alleviate this case, correct?

stale · 2020-03-24T22:31:30Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-04-06T18:37:43Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

mvines · 2020-04-06T18:45:14Z

(so sad! We'll land this one day. @t-nelson since you're super busy making the cli and ledger awesome, I imagine we'll find somebody else to take this off your hands one day hopefully soon. Maybe @ryoqun :) )

stale · 2020-04-13T19:29:21Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-04-22T18:50:41Z

This stale pull request has been automatically closed. Thank you for your contributions.

mvines · 2020-04-27T20:44:59Z

@t-nelson - I'm going to take over this PR

stale · 2020-05-04T22:42:48Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

mvines · 2020-05-07T15:58:08Z

#9902

TristanDebrunner requested review from aeyakovenko, carllin and sagar-solana December 11, 2019 22:35

TristanDebrunner force-pushed the revive-saving-locktower branch 2 times, most recently from f6230a0 to 604a911 Compare December 11, 2019 22:57

sagar-solana reviewed Dec 12, 2019

View reviewed changes