Save tower state persistently#7436
Save tower state persistently#7436TristanDebrunner wants to merge 14 commits intosolana-labs:masterfrom
Conversation
f6230a0 to
604a911
Compare
Codecov Report
@@ Coverage Diff @@
## master #7436 +/- ##
========================================
+ Coverage 80.5% 80.5% +<.1%
========================================
Files 253 253
Lines 55412 55636 +224
========================================
+ Hits 44637 44839 +202
- Misses 10775 10797 +22 |
sagar-solana
left a comment
There was a problem hiding this comment.
Looks ok to me. @carllin do we need to update working bank based on last voted on bank from tower or will all of that just automatically resolve itself?
|
I discussed with @carllin and @CriesofCarrots, and the conclusion was that it was fine as is. |
604a911 to
a449297
Compare
Pull request has been modified.
|
@sagar-solana the last voted on bank in tower might not exist after loading from a snapshot. I think iit should be fine, we can't do anything slashable with a working bank we haven't voted on. |
| e | ||
| ); | ||
| } | ||
| Tower::new(&my_pubkey, &vote_account, &bank_forks.read().unwrap()) |
There was a problem hiding this comment.
Tower::new() will try to initialize the heaviest bank in bank_forks as root. I'm not sure this is the correct thing to do if we can't find tower state/tower state is corrupted.
Thinking we should panic, ask for manual intervention (probably use private key to transfer funds out into a different account to avoid slashing).
@aeyakovenko, what do you think?
There was a problem hiding this comment.
Snapshot recovery, basically needs to send a tx, assert we are on the current fork, then expire max censorship threshold
There was a problem hiding this comment.
What is max censorship threshold?
There was a problem hiding this comment.
@aeyakovenko can you elaborate on what you mean by "expire max censorship threshold"?
There was a problem hiding this comment.
@TristanDebrunner wait for 2^threshold slots to start voting. This is necessary if the VoteState is reused between reboots. If it’s a new VoteState there is no chance of accidentally voting on a fork.
There was a problem hiding this comment.
I assume you're referring to the threshold introduced in #6744?
There was a problem hiding this comment.
It's part of the current implementation already.
Line 19 in e98132f
There was a problem hiding this comment.
How about changing this to return a Result<Tower>? Then if it returns Err, replay stage can check if the validator's vote account exists in the bank. If it does exist and isn't empty, we panic as @carllin suggested. Otherwise, it's a new validator, so we can use the default that currently gets returned silently.
Then once the relevant checks in #6936 are implemented, replay stage can wait instead of panicing
There was a problem hiding this comment.
Should we also double check the signature to make sure the state wasn't corrupted?
There was a problem hiding this comment.
@TristanDebrunner - is @carllin's question resolved? I see you asked for review again 7 days ago
802d710 to
b3141f1
Compare
f373275 to
74cea8d
Compare
| .get_account(&vote_account) | ||
| { | ||
| if let Some(vote_state) = VoteState::from(&account) { | ||
| if !vote_state.votes.is_empty() { |
There was a problem hiding this comment.
Hmmm, just because the vote state is empty is no guarantee that this validator hasn't already voted right? This validator could have voted and the votes didn't land in a bank yet, which opens the door for slashing if:
- The tower state is corrupt (which is true by the time this logic executes)
- The votes land later
There was a problem hiding this comment.
@carllin the validator can wait MAX_RECENT_BLOCKHASHES before it starts voting to alleviate this case, correct?
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
|
@t-nelson - I'm going to take over this PR |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Problem
The
Towerstate of a validator is lost if the validator stops/crashes. This could lead to the validator voting in a slashable way upon reboot.Summary of Changes
Have replay stage save its
Towerstate when generating a new vote, before sending the vote to the cluster. Also have replay stage try to load the saved state when it starts.Thanks to @sagar-solana for the original work on this.
Context
towards: #6936