-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Allow validators to reset to the slot which matches their last voted slot #28172
Conversation
Turns out we already had a test for a very similar case: |
The latest commit has a test that fails on latest master but is fixed in this PR. This is now ready for review! |
f6be321
to
addc85f
Compare
Just pushed a small patch to the failing test |
); | ||
// Should now pick 5 the heaviest fork from last vote again. | ||
assert!(vote_fork.is_none()); | ||
assert_eq!(reset_fork.unwrap(), 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also would fail here if it didn't get stuck on the previous failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice gud test I like
return None; | ||
match self.is_candidate(&last_voted_slot_hash) { | ||
Some(true) => self.best_slot(&last_voted_slot_hash), | ||
Some(false) => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a fine change, but trying to minimize as many logical changes as possible.
In this new code is_candidate(&last_voted_slot_hash) == false
, this returns None. This seems like it might be a deviation from the original code.
To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false)
then that implies inthe old code that heaviest_slot_hash_on_same_voted_fork == None
in the old code, which I don't think is necessarily true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time understanding what you're suggesting that I change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies in the old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?
I think he's saying that this is a new condition that wasn't checked in the old code that might now return None.
I don't think this implies in the old code that heaviest_slot_hash_on_same_voted_fork
needed to be None, this just requires that there is an invalid ancestor which wasn't being checked before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed that we should try to reduce the change set to the bare minimum here. we need to fix the known bug now, further hardening can wait.
kinda confusing that a comment like this would result in an approved review. seems blocking
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is already at the bare minimum in my opinion. Take note that self.is_candidate(..).is_none()
and self.best_slot(..).is_none()
is the same thing because each of them is only None
when the queried slot is not in fork_infos
.
The only logical change is switching from filtering the returned heaviest slot by fork_info.is_candidate
rather than checking if it is the same slot as the last vote. Checking is_candidate
is crucial here because otherwise we would build blocks from heaviest bank even if it was a duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-nelson @carllin if you disagree on what the minimum change, please let me know. I'm also in favor of keeping the scope of this change as low as possible.
If we remove the last vote equality check and don't add the is_candidate check, then during the mainnet-beta issue, validators would have kept building off of 221 before it was determined to be confirmed. This might be a good thing actually but just wanted to call out that the last vote equality check happened to prevent this from happening.
The bug I intended to fix was that validators should've started building off of 221 after they detected it to be a confirmed duplicate, but didn't. Without is_candidate
, the bug being fixed is that validators should've continued building off of 221 as soon as they started trying to switch to separate fork (225).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more digging and discussion with @carllin, it looks like we have to add the is_candidate
check after all because otherwise we could get into situations where the heaviest_slot_on_same_voted_fork
was purged (because the cluster confirmed a different version) and we try to reset to a purged bank. I think that avoiding resetting to invalid forks should be avoided at all costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that avoiding resetting to invalid forks should be avoided at all costs.
this would be separate from the condition that halted MB though, right? i'm not saying that we shouldn't harden further, just that we can run anything extra through the usual (priority) stabilization/backport process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did agree to add the is_candidate check in the normal process but without it we expose validators to increased risk of panicking due to purged slots.
return None; | ||
match self.is_candidate(&last_voted_slot_hash) { | ||
Some(true) => self.best_slot(&last_voted_slot_hash), | ||
Some(false) => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies in the old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?
I think he's saying that this is a new condition that wasn't checked in the old code that might now return None.
I don't think this implies in the old code that heaviest_slot_hash_on_same_voted_fork
needed to be None, this just requires that there is an invalid ancestor which wasn't being checked before.
Pull request has been modified.
return None; | ||
match self.is_candidate(&last_voted_slot_hash) { | ||
Some(true) => self.best_slot(&last_voted_slot_hash), | ||
Some(false) => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed that we should try to reduce the change set to the bare minimum here. we need to fix the known bug now, further hardening can wait.
kinda confusing that a comment like this would result in an approved review. seems blocking
core/src/replay_stage.rs
Outdated
assert!(vote_fork.is_none()); | ||
assert_eq!(reset_fork.unwrap(), 5); | ||
|
||
// Mark 5 as duplicate, 4 should be the heaviest slot, but should not be votable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we assert the latter two conditions as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are implicitly asserted by checking that vote bank is none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment? 🥺
eb34319
to
c175173
Compare
automerge label removed due to a CI failure |
…slot (backport #28172) (#28185) Allow validators to reset to the slot which matches their last voted slot (#28172) * Add failing test * Allow resetting to duplicate was confirmed * feedback * feedback * bump * simplify change * Revert "simplify change" This reverts commit 72e5de3. * update comment * Update core/src/replay_stage.rs (cherry picked from commit c2bb2b8) Co-authored-by: Justin Starry <[email protected]>
…slot (backport #28172) (#28186) Allow validators to reset to the slot which matches their last voted slot (#28172) * Add failing test * Allow resetting to duplicate was confirmed * feedback * feedback * bump * simplify change * Revert "simplify change" This reverts commit 72e5de3. * update comment * Update core/src/replay_stage.rs (cherry picked from commit c2bb2b8) Co-authored-by: Justin Starry <[email protected]>
…slot (backport #28172) (#28187) Allow validators to reset to the slot which matches their last voted slot (#28172) * Add failing test * Allow resetting to duplicate was confirmed * feedback * feedback * bump * simplify change * Revert "simplify change" This reverts commit 72e5de3. * update comment * Update core/src/replay_stage.rs (cherry picked from commit c2bb2b8) Co-authored-by: Justin Starry <[email protected]>
/// If `heaviest_bank_on_same_voted_fork` is `None` due to that fork no | ||
/// longer being valid to vote on, it's possible that a validator will not | ||
/// be able to reset away from the invalid fork that they last voted on. To | ||
/// resolve this scenario, validators need to wait until they can create a | ||
/// switch proof for another fork or until the invalid fork is be marked | ||
/// valid again if it was confirmed by the cluster. | ||
/// Until this is resolved, leaders will build each of their | ||
/// blocks from the last reset bank on the invalid fork. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -5605,6 +5614,139 @@ pub(crate) mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_unconfirmed_duplicate_slots_and_lockouts_for_non_heaviest_fork() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ textbook-level quite readable unit test.
assert_eq!(vote_fork.unwrap(), 4); | ||
assert_eq!(reset_fork.unwrap(), 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can just resort to ()
for adhoc type:
assert_eq!((vote_fork, reset_fork), (Some(4), Some(4));
this should give most helpful assert.
...in other words, rust rocks
same for below this paired asserts.
// (newer snapshot, or only a saved tower is moved over to new setup?) | ||
return None; | ||
match self.is_candidate(&last_voted_slot_hash) { | ||
Some(true) => self.best_slot(&last_voted_slot_hash), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mentioned at the different review thread, but i'd prefer provable .unwrap()
here for less conditions: Some(self.best_slot(&last_voted_slot_hash).unwrap())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more (post-merge) lgtm from me, who did independent root cause analysis. :)
…t voted slot (backport solana-labs#28172) (solana-labs#28187)" This reverts commit 0a7006a.
…t voted slot (backport solana-labs#28172) (solana-labs#28187)" This reverts commit 0a7006a.
…heir last voted slot (backport solana-labs#28172) (solana-labs#28187)"" This reverts commit 2e25fbc.
…t voted slot (backport solana-labs#28172) (solana-labs#28187)" This reverts commit 0a7006a.
…heir last voted slot (backport solana-labs#28172) (solana-labs#28187)"" This reverts commit 2e25fbc.
…t voted slot (solana-labs#28172)" This reverts commit c2bb2b8.
Problem
Validators are unable to reset back to the heaviest bank on the fork that includes their last vote if the heaviest bank's slot matches their last voted slot. If there is a heavier bank on the last voted fork than the last voted bank itself, validators can erroneously reset to an invalid slot.
Summary of Changes
Fixes #