Save/restore Tower#9902
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9902 +/- ##
=======================================
Coverage 81.5% 81.5%
=======================================
Files 288 280 -8
Lines 66475 65919 -556
=======================================
- Hits 54198 53775 -423
+ Misses 12277 12144 -133 |
5582bed to
fe5c230
Compare
26d9df3 to
0f02ffa
Compare
|
I think this is ready for review |
t-nelson
left a comment
There was a problem hiding this comment.
LGTM! Thanks for dragging this one over the line 🙏
| error!("Tower restore failed: {:?}", err); | ||
| process::exit(1); | ||
| } | ||
| info!("Rebuilding tower from the latest vote account"); |
There was a problem hiding this comment.
@t-nelson, do you remember what the original design was for the case where tower restoration failed? The danger with restoring from the latest vote account is that it could be missing some votes that have been submitted onto the network, but have not landed in any bank.
I think we said we should either signal to the user to make a new vote account and eat the warmup/cooldown for staking, or have them accept the risk of potentially submitting conflicting votes.
There was a problem hiding this comment.
Ultimately I think it comes down to what the slashing design is (which we don't have finalized, so it makes i hard to reason about this). For instance, if we decide slashing can only occur for votes that land in a bank, and that all votes before the root are not slashable (not sure if this is a reasonable assumption) and you have a reasonable guess for the range of your last vote, maybe you can just wait for a root to be made that is sufficiently far past your estimated last vote... just food for thought
There was a problem hiding this comment.
Hmm... Yeah, there should definitely be user intervention here (at least by the time slashing is enabled. It is behind a CLI flag. Maybe it'd be sufficient to reverse its default behavior? That is require the tower by default.
There was a problem hiding this comment.
agreed, there should be some level of intervention notification here
| self.lockouts.root_slot | ||
| } | ||
|
|
||
| pub fn last_lockout_vote_slot(&self) -> Option<Slot> { |
There was a problem hiding this comment.
will this always just return the last item in self.lockouts.votes since the votes are always ordered smallest to largest?
There was a problem hiding this comment.
Yeah but the vote ordering doesn't matter since this implementation does a max_by()
There was a problem hiding this comment.
oh I meant it seems like the implementation can be simplified to self.lockouts.votes.first().map(|v| v.slot)
There was a problem hiding this comment.
Ah, right. I was scared to make that sneaky assumption. Feels like it could silently break on me in the future. Should I just find some courage? 🦁
There was a problem hiding this comment.
hehe, true, but we could write a test that asserts this finds the largest!
|
|
||
| pub fn adjust_lockouts_if_newer_root(&mut self, root_slot: Slot) { | ||
| let my_root_slot = self.lockouts.root_slot.unwrap_or(0); | ||
| if root_slot > my_root_slot { |
There was a problem hiding this comment.
@t-nelson what happens in this case if there are votes in the tower that are still locked out past the snapshot root?
Aka you have a fork structure that looks like:
0
/ \
2 3
Most recently you voted for 2 in your tower, but your snapshot root is 3, you shouldn't vote until at least slot 4.
I thought that was why you had to consult the large set of ancestors in the snapshot root to see if you're still locked out/when you can start voting, or have we ditched that design?
There was a problem hiding this comment.
Yeah that sounds right. IIRC we decided to punt those changes to a later PR, being as it's a corner case of a corner case.
|
Crud, rebased and am now hitting an .unwrap() added by #9218 |
| let my_root_slot = self.lockouts.root_slot.unwrap_or(0); | ||
| if root_slot > my_root_slot { | ||
| self.lockouts.root_slot = Some(root_slot); | ||
| self.lockouts.votes.retain(|v| v.slot > root_slot); |
There was a problem hiding this comment.
@t-nelson I think this part is why we need to consult the ancestors, what if the votes higher than the root_slot don't descend from root_slot here. It may not be that much of an edge case. I vote for a slot
0
/ \
2 3
I vote for slot 3, crash, and promptly boot back up. The rest of the cluster has rooted slot 2 and I get a snapshot for slot 2. Now my tower is 3, 2, which is inconsistent, as it's assumed tower is all one fork (see giant comment below for proposal to handle this).
|
@t-nelson @mvines doh! The only way that a slot But that the situation described above can occur, and does need to be handled. @aeyakovenko feel free to double check me here, but I think this is how it should be handled: The situation looks something like: Your saved tower has 0, 2, 4, but the snapshot you are starting from has root 3. Because blockstore proccessor will only play descendants of 3 on startup when generating So then how does this validator rejoin the main fork at root
When both conditions 1) and 2) are met, then the validator can finally vote on some descendant |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
|
@mvines thanks for opening this! Finally, I think I can get my hands on this... |
| } | ||
|
|
||
| // Given an untimely crash, tower may have roots that are not reflected in blockstore because | ||
| // `ReplayState::handle_votable_bank()` saves tower before setting blockstore roots |
There was a problem hiding this comment.
@t-nelson @carllin (Hi! I'm taking over this pr to merge this time really from @mvines .)
A bit dumb question, why are we saving this into plain-old file instead of rocksdb?
I think we can just save this under rocksdb with new columnfamily and get atomicity for free with rocksdb's write batch: https://github.com/facebook/rocksdb/wiki/Column-Families#writebatch https://docs.rs/rocksdb/0.14.0/rocksdb/struct.Options.html#method.set_atomic_flush (we're using wal)
Also, saves tower before setting blockstore roots cannot be guaranteed because we currently aren't doing any fdatasync() or equivalent, considering validator process crash or even os-level crash?
And naive-fdatasync(tower bin file) here would hurt performance.. Also, this doesn't gurantee the write-ordering between the plain old file and rocksdb. Not just tower before blockstore, the opposite will be possible.
In overall, I think it's a lot easier we just rely on rocksdb? I'm glad to comment the reason if I'm missing something. :)
There was a problem hiding this comment.
@ryoqun storing in rocks should be fine, yeah. I think when @TristanDebrunner originally started this, there was an initiative to drop our RocksDB dependency, so avoided leaning on it more
There was a problem hiding this comment.
@ryoqun I think it's a durability issue. Votes are sensitive enough (slashing condition!) where we want to make sure the vote is persisted to disk before submitting the vote to the network. Even with a write batch + write ahead log, Rocks buffers a lot of the writes in memory buffers without guaranteeing they are immediately flushed.
Yeah I think we should probably be fsync'ing the file in tower.save
It would be good to measure the fsync performance, if it's really bad we can pipeline it with the rest of the replay logic so we don't halt on replaying further blocks. We just need to have a queue of votes that are pending commit to disk, and have not yet been submitted to the network.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
Reboot of #7436
TODO:
context:
#6936