Save/restore Tower by mvines · Pull Request #9902 · solana-labs/solana

mvines · 2020-05-06T18:51:05Z

Reboot of #7436

TODO:

Fish out remaining tests from Save tower state persistently #7436 🎣

context:
#6936

codecov · 2020-05-06T20:55:24Z

Codecov Report

Merging #9902 into master will increase coverage by 0.0%.
The diff coverage is 80.0%.

@@           Coverage Diff           @@
##           master   #9902    +/-   ##
=======================================
  Coverage    81.5%   81.5%            
=======================================
  Files         288     280     -8     
  Lines       66475   65919   -556     
=======================================
- Hits        54198   53775   -423     
+ Misses      12277   12144   -133

mvines · 2020-05-08T22:15:04Z

I think this is ready for review

t-nelson

LGTM! Thanks for dragging this one over the line 🙏

carllin · 2020-05-10T21:55:13Z

+                error!("Tower restore failed: {:?}", err);
+                process::exit(1);
+            }
+            info!("Rebuilding tower from the latest vote account");


@t-nelson, do you remember what the original design was for the case where tower restoration failed? The danger with restoring from the latest vote account is that it could be missing some votes that have been submitted onto the network, but have not landed in any bank.

I think we said we should either signal to the user to make a new vote account and eat the warmup/cooldown for staking, or have them accept the risk of potentially submitting conflicting votes.

Ultimately I think it comes down to what the slashing design is (which we don't have finalized, so it makes i hard to reason about this). For instance, if we decide slashing can only occur for votes that land in a bank, and that all votes before the root are not slashable (not sure if this is a reasonable assumption) and you have a reasonable guess for the range of your last vote, maybe you can just wait for a root to be made that is sufficiently far past your estimated last vote... just food for thought

Hmm... Yeah, there should definitely be user intervention here (at least by the time slashing is enabled. It is behind a CLI flag. Maybe it'd be sufficient to reverse its default behavior? That is require the tower by default.

agreed, there should be some level of intervention notification here

carllin · 2020-05-10T22:01:09Z

        self.lockouts.root_slot
    }

+    pub fn last_lockout_vote_slot(&self) -> Option<Slot> {


will this always just return the last item in self.lockouts.votes since the votes are always ordered smallest to largest?

Yeah but the vote ordering doesn't matter since this implementation does a max_by()

oh I meant it seems like the implementation can be simplified to self.lockouts.votes.first().map(|v| v.slot)

Ah, right. I was scared to make that sneaky assumption. Feels like it could silently break on me in the future. Should I just find some courage? 🦁

hehe, true, but we could write a test that asserts this finds the largest!

carllin · 2020-05-10T22:06:10Z


+    pub fn adjust_lockouts_if_newer_root(&mut self, root_slot: Slot) {
+        let my_root_slot = self.lockouts.root_slot.unwrap_or(0);
+        if root_slot > my_root_slot {


@t-nelson what happens in this case if there are votes in the tower that are still locked out past the snapshot root?

Aka you have a fork structure that looks like:

0 / \ 2 3

Most recently you voted for 2 in your tower, but your snapshot root is 3, you shouldn't vote until at least slot 4.

I thought that was why you had to consult the large set of ancestors in the snapshot root to see if you're still locked out/when you can start voting, or have we ditched that design?

Yeah that sounds right. IIRC we decided to punt those changes to a later PR, being as it's a corner case of a corner case.

Pull request has been modified.

mvines · 2020-05-13T03:28:22Z

Crud, rebased and am now hitting an .unwrap() added by #9218

thread 'solana-replay-stage' panicked at 'called `Option::unwrap()` on a `None` value', core/src/consensus.rs:381:43

mvines · 2020-05-13T03:46:04Z

@carllin - hey can you please check out 89642e2. test_snapshot_restart_tower now hits this unwrap when the validator restarts from a saved tower due to how Tower::adjust_lockouts_if_newer_root() retains any votes the validator previously made that are higher than the snapshot root

carllin · 2020-05-13T08:24:17Z

+        let my_root_slot = self.lockouts.root_slot.unwrap_or(0);
+        if root_slot > my_root_slot {
+            self.lockouts.root_slot = Some(root_slot);
+            self.lockouts.votes.retain(|v| v.slot > root_slot);


@t-nelson I think this part is why we need to consult the ancestors, what if the votes higher than the root_slot don't descend from root_slot here. It may not be that much of an edge case. I vote for a slot

0 / \ 2 3

I vote for slot 3, crash, and promptly boot back up. The rest of the cluster has rooted slot 2 and I get a snapshot for slot 2. Now my tower is 3, 2, which is inconsistent, as it's assumed tower is all one fork (see giant comment below for proposal to handle this).

carllin · 2020-05-13T09:45:34Z

@t-nelson @mvines doh! The only way that a slot X in the tower doesn't have ancestors when restarting from a snapshot for slot Y is if X doesn't descend from Y right? Otherwise, on restart, the blockstore_processor logic should replay all descendants of Y, and thus if X was a descendant of Y, it should appear in BankForks and thus should be in the ancestors map.

But that the situation described above can occur, and does need to be handled. @aeyakovenko feel free to double check me here, but I think this is how it should be handled:

The situation looks something like:

                                         0 (saved tower root)
                                          |
                                          2 (saved tower vote)
                                      /         \
                       (snapshot root) 3          4 (saved tower vote)

Your saved tower has 0, 2, 4, but the snapshot you are starting from has root 3.

Because blockstore proccessor will only play descendants of 3 on startup when generating BankForks, then BankForks won't include bank 4, so there will be no ancestor information for your latest voted slot 4. Thus if on startup we see this case: 89642e2#diff-d848cba681535238849e7e33ed32702bR382, then we know the snapshot root is not an ancestor of the latest vote.

So then how does this validator rejoin the main fork at root 3 in a safe way?

In order to maintain the tower's integrity I don't think we can naively retain here: https://github.com/solana-labs/solana/pull/9902/files#diff-d848cba681535238849e7e33ed32702bR571, and set the root to the snapshot here: https://github.com/solana-labs/solana/pull/9902/files#diff-d848cba681535238849e7e33ed32702bR570. We need to first determine which vote in the saved tower was the latest ancestor of the snapshot slot. In the example above this would be 2. This ancestor can be found by consulting the chain of saved ancestors in the account state of the bank. Let everything greater than this ancestor in the tower be called the set G. G is the set of slots that is "locked out", aka lockouts would be violated if the validator were to switch forks. The validator should not vote on any slot less thanmax(g_i + lockout(g_i)) across all g_i in G (it should wait for such a slot).

A valid switching proof is needed in order for this to happen, because the validator is switching forks from the last vote 4. In order to generate a valid switching proof, you need to see > SWITCH_FORK_THRESHOLD of the stake locked out on a slot X that is not an ancestor or descendant of 4, and the lockout(X) + X > 4. In this case the snapshot root 3 is an acceptable such X because we know by this point it's not an ancestor or descendant of 4, so we just need to wait until we see > SWITCH_FORK_THRESHOLD of the stake voting on 3 that are locked out past 4 before we can return true here: 89642e2#diff-d848cba681535238849e7e33ed32702bR382.
To do this I think it's as simple as replacing the unwrap:

solana/core/src/consensus.rs

Line 353 in 1eb40c3

let last_vote_ancestors = ancestors.get(&last_vote).unwrap();

with unwrap_or_default(). With that change, these checks should all be skipped:

solana/core/src/consensus.rs

Lines 356 to 364 in 1eb40c3

    
           if switch_slot == *last_vote || switch_slot_ancestors.contains(last_vote) { 
        
               // If the `switch_slot is a descendant of the last vote, 
        
               // no switching proof is neceessary 
        
               return true; 
        
           } 
        
           // Should never consider switching to an ancestor 
        
           // of your last vote 
        
           assert!(!last_vote_ancestors.contains(&switch_slot));

as they should all be true. This check here:

solana/core/src/consensus.rs

Line 403 in 1eb40c3

for (_, value) in lockout_intervals.range((Included(last_vote), Unbounded)) {

will then naturally only count lockouts that are past that latest vote (in this example would be latest vote = 4) on descendants of 3 (as those descendants are the only ones being played anyways).

When both conditions 1) and 2) are met, then the validator can finally vote on some descendant Y of the snapshot root. At this time, Y will be at the top of the tower and have popped off any non-descendants of the snapshot root (the entire set G from 1) will have been popped off). We can then set the root of the tower to the snapshot root (3 in this case) and the vote for Y will also write the tower to disk, thus committing and finalizing the recovery process.

stale · 2020-06-01T20:08:47Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-06-09T10:53:46Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-06-16T11:42:32Z

This stale pull request has been automatically closed. Thank you for your contributions.

ryoqun · 2020-06-16T17:10:15Z

@mvines thanks for opening this! Finally, I think I can get my hands on this...

ryoqun · 2020-06-18T11:34:21Z

+}
+
+// Given an untimely crash, tower may have roots that are not reflected in blockstore because
+// `ReplayState::handle_votable_bank()` saves tower before setting blockstore roots


@t-nelson @carllin (Hi! I'm taking over this pr to merge this time really from @mvines .)

A bit dumb question, why are we saving this into plain-old file instead of rocksdb?
I think we can just save this under rocksdb with new columnfamily and get atomicity for free with rocksdb's write batch: https://github.com/facebook/rocksdb/wiki/Column-Families#writebatch https://docs.rs/rocksdb/0.14.0/rocksdb/struct.Options.html#method.set_atomic_flush (we're using wal)

Also, saves tower before setting blockstore roots cannot be guaranteed because we currently aren't doing any fdatasync() or equivalent, considering validator process crash or even os-level crash?

And naive-fdatasync(tower bin file) here would hurt performance.. Also, this doesn't gurantee the write-ordering between the plain old file and rocksdb. Not just tower before blockstore, the opposite will be possible.

In overall, I think it's a lot easier we just rely on rocksdb? I'm glad to comment the reason if I'm missing something. :)

@ryoqun storing in rocks should be fine, yeah. I think when @TristanDebrunner originally started this, there was an initiative to drop our RocksDB dependency, so avoided leaning on it more

@ryoqun I think it's a durability issue. Votes are sensitive enough (slashing condition!) where we want to make sure the vote is persisted to disk before submitting the vote to the network. Even with a write batch + write ahead log, Rocks buffers a lot of the writes in memory buffers without guaranteeing they are immediately flushed.

Yeah I think we should probably be fsync'ing the file in tower.save

It would be good to measure the fsync performance, if it's really bad we can pipeline it with the rest of the replay logic so we don't halt on replaying further blocks. We just need to have a queue of votes that are pending commit to disk, and have not yet been submitted to the network.

stale · 2020-07-02T07:39:17Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-07-09T13:21:37Z

This stale pull request has been automatically closed. Thank you for your contributions.

mvines force-pushed the tower branch from 3ede750 to 4b6d1aa Compare May 6, 2020 19:07

mvines force-pushed the tower branch 2 times, most recently from 5582bed to fe5c230 Compare May 7, 2020 08:36

mvines added the CI Pull Request is ready to enter CI label May 7, 2020

solana-grimes removed the CI Pull Request is ready to enter CI label May 7, 2020

mvines mentioned this pull request May 7, 2020

Save tower state persistently #7436

Closed

mvines force-pushed the tower branch 5 times, most recently from 26d9df3 to 0f02ffa Compare May 8, 2020 17:10

mvines requested review from carllin and t-nelson May 8, 2020 22:14

t-nelson previously approved these changes May 8, 2020

View reviewed changes

mvines mentioned this pull request May 9, 2020

Rebuilding state on restart #4132

Closed

carllin reviewed May 10, 2020

View reviewed changes

mvines force-pushed the tower branch from 0f02ffa to 6d9e782 Compare May 12, 2020 17:26

mvines force-pushed the tower branch from 6d9e782 to 0601bd6 Compare May 12, 2020 17:31

mvines force-pushed the tower branch from 0601bd6 to 89642e2 Compare May 13, 2020 03:44

carllin reviewed May 13, 2020

View reviewed changes

mvines force-pushed the tower branch from 89642e2 to 5e6c4d3 Compare May 19, 2020 16:32

mvines added the v1.2 label May 25, 2020

mvines added 2 commits May 25, 2020 10:36

Save/restore Tower

3a1d3a8

Avoid unwrap()

ddae86f

mvines force-pushed the tower branch from 5e6c4d3 to ddae86f Compare May 25, 2020 17:36

stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 1, 2020

ryoqun self-assigned this Jun 2, 2020

stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 2, 2020

stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 9, 2020

stale Bot closed this Jun 16, 2020

mvines reopened this Jun 16, 2020

stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 16, 2020

mvines mentioned this pull request Jun 17, 2020

Submit a vote timestamp every vote #10630

Merged

ryoqun reviewed Jun 18, 2020

View reviewed changes

This was referenced Jun 19, 2020

Persistent tower #10718

Merged

Add {Vote, Tower}::last_voted_slot() #10734

Merged

Further expand last_voted_slot terminology #10747

Merged

stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 2, 2020

stale Bot closed this Jul 9, 2020

Conversation

mvines commented May 6, 2020 • edited by ryoqun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mvines commented May 8, 2020

Uh oh!

t-nelson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin May 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin May 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvines commented May 13, 2020

Uh oh!

mvines commented May 13, 2020

Uh oh!

carllin May 13, 2020 • edited by ryoqun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin commented May 13, 2020 • edited by ryoqun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stale Bot commented Jun 1, 2020

Uh oh!

stale Bot commented Jun 9, 2020

Uh oh!

stale Bot commented Jun 16, 2020

Uh oh!

ryoqun commented Jun 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale Bot commented Jul 2, 2020

Uh oh!

stale Bot commented Jul 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mvines commented May 6, 2020 •

edited by ryoqun

Loading

codecov Bot commented May 6, 2020 •

edited

Loading

carllin May 10, 2020 •

edited

Loading

carllin May 12, 2020 •

edited

Loading

carllin May 10, 2020 •

edited

Loading

carllin May 13, 2020 •

edited by ryoqun

Loading

carllin commented May 13, 2020 •

edited by ryoqun

Loading

carllin Jun 18, 2020 •

edited

Loading