Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Save tower state persistently#7436

Closed
TristanDebrunner wants to merge 14 commits intosolana-labs:masterfrom
TristanDebrunner:revive-saving-locktower
Closed

Save tower state persistently#7436
TristanDebrunner wants to merge 14 commits intosolana-labs:masterfrom
TristanDebrunner:revive-saving-locktower

Conversation

@TristanDebrunner
Copy link
Copy Markdown
Contributor

@TristanDebrunner TristanDebrunner commented Dec 11, 2019

Problem

The Tower state of a validator is lost if the validator stops/crashes. This could lead to the validator voting in a slashable way upon reboot.

Summary of Changes

Have replay stage save its Tower state when generating a new vote, before sending the vote to the cluster. Also have replay stage try to load the saved state when it starts.

Thanks to @sagar-solana for the original work on this.

Context

towards: #6936

@TristanDebrunner TristanDebrunner force-pushed the revive-saving-locktower branch 2 times, most recently from f6230a0 to 604a911 Compare December 11, 2019 22:57
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 11, 2019

Codecov Report

Merging #7436 into master will increase coverage by <.1%.
The diff coverage is 89.3%.

@@           Coverage Diff            @@
##           master   #7436     +/-   ##
========================================
+ Coverage    80.5%   80.5%   +<.1%     
========================================
  Files         253     253             
  Lines       55412   55636    +224     
========================================
+ Hits        44637   44839    +202     
- Misses      10775   10797     +22

Comment thread core/src/replay_stage.rs Outdated
sagar-solana
sagar-solana previously approved these changes Dec 12, 2019
Copy link
Copy Markdown
Contributor

@sagar-solana sagar-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me. @carllin do we need to update working bank based on last voted on bank from tower or will all of that just automatically resolve itself?

@TristanDebrunner
Copy link
Copy Markdown
Contributor Author

I discussed with @carllin and @CriesofCarrots, and the conclusion was that it was fine as is.

@mergify mergify Bot dismissed sagar-solana’s stale review December 12, 2019 20:27

Pull request has been modified.

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Dec 12, 2019

@sagar-solana the last voted on bank in tower might not exist after loading from a snapshot. I think iit should be fine, we can't do anything slashable with a working bank we haven't voted on.

Comment thread core/src/replay_stage.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
e
);
}
Tower::new(&my_pubkey, &vote_account, &bank_forks.read().unwrap())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tower::new() will try to initialize the heaviest bank in bank_forks as root. I'm not sure this is the correct thing to do if we can't find tower state/tower state is corrupted.

Thinking we should panic, ask for manual intervention (probably use private key to transfer funds out into a different account to avoid slashing).

@aeyakovenko, what do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot recovery, basically needs to send a tx, assert we are on the current fork, then expire max censorship threshold

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is max censorship threshold?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeyakovenko can you elaborate on what you mean by "expire max censorship threshold"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TristanDebrunner wait for 2^threshold slots to start voting. This is necessary if the VoteState is reused between reboots. If it’s a new VoteState there is no chance of accidentally voting on a fork.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you're referring to the threshold introduced in #6744?

Copy link
Copy Markdown
Member

@aeyakovenko aeyakovenko Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the current implementation already.

pub const VOTE_THRESHOLD_DEPTH: usize = 8;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing this to return a Result<Tower>? Then if it returns Err, replay stage can check if the validator's vote account exists in the bank. If it does exist and isn't empty, we panic as @carllin suggested. Otherwise, it's a new validator, so we can use the default that currently gets returned silently.

Then once the relevant checks in #6936 are implemented, replay stage can wait instead of panicing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also double check the signature to make sure the state wasn't corrupted?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TristanDebrunner - is @carllin's question resolved? I see you asked for review again 7 days ago

Comment thread core/src/replay_stage.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
@TristanDebrunner TristanDebrunner force-pushed the revive-saving-locktower branch 2 times, most recently from 802d710 to b3141f1 Compare December 13, 2019 01:57
@TristanDebrunner TristanDebrunner force-pushed the revive-saving-locktower branch 4 times, most recently from f373275 to 74cea8d Compare December 26, 2019 18:43
@mvines mvines added this to the Supertubes v0.22.3 milestone Jan 6, 2020
sakridge
sakridge previously approved these changes Jan 6, 2020
Copy link
Copy Markdown
Contributor

@sakridge sakridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mvines mvines added the v0.22 label Jan 6, 2020
Comment thread core/src/replay_stage.rs Outdated
.get_account(&vote_account)
{
if let Some(vote_state) = VoteState::from(&account) {
if !vote_state.votes.is_empty() {
Copy link
Copy Markdown
Contributor

@carllin carllin Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, just because the vote state is empty is no guarantee that this validator hasn't already voted right? This validator could have voted and the votes didn't land in a bank yet, which opens the door for slashing if:

  1. The tower state is corrupt (which is true by the time this logic executes)
  2. The votes land later

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin the validator can wait MAX_RECENT_BLOCKHASHES before it starts voting to alleviate this case, correct?

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 17, 2020
@t-nelson t-nelson removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 17, 2020
@stale
Copy link
Copy Markdown

stale Bot commented Mar 24, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 24, 2020
@t-nelson t-nelson removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 24, 2020
@mvines mvines modified the milestones: v1.1.0, v1.2.0 Mar 30, 2020
@stale
Copy link
Copy Markdown

stale Bot commented Apr 6, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 6, 2020
@t-nelson t-nelson removed the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 6, 2020
@mvines
Copy link
Copy Markdown
Contributor

mvines commented Apr 6, 2020

(so sad! We'll land this one day. @t-nelson since you're super busy making the cli and ledger awesome, I imagine we'll find somebody else to take this off your hands one day hopefully soon. Maybe @ryoqun :) )

@stale
Copy link
Copy Markdown

stale Bot commented Apr 13, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 13, 2020
@stale
Copy link
Copy Markdown

stale Bot commented Apr 22, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale Bot closed this Apr 22, 2020
@mvines mvines reopened this Apr 22, 2020
@stale stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 22, 2020
@mvines mvines assigned mvines and unassigned t-nelson Apr 27, 2020
@mvines
Copy link
Copy Markdown
Contributor

mvines commented Apr 27, 2020

@t-nelson - I'm going to take over this PR

@stale
Copy link
Copy Markdown

stale Bot commented May 4, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 4, 2020
@mvines mvines mentioned this pull request May 6, 2020
1 task
@mvines
Copy link
Copy Markdown
Contributor

mvines commented May 7, 2020

#9902

@mvines mvines closed this May 7, 2020
@ryoqun ryoqun mentioned this pull request Jun 19, 2020
1 task
@mvines mvines modified the milestones: v1.2.0, v1.3.0 Aug 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

stale [bot only] Added to stale content; results in auto-close after a week.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants