Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Allow validators to reset to the slot which matches their last voted slot #28172

Merged
merged 9 commits into from
Oct 3, 2022

Conversation

jstarry
Copy link
Contributor

@jstarry jstarry commented Oct 1, 2022

Problem

Validators are unable to reset back to the heaviest bank on the fork that includes their last vote if the heaviest bank's slot matches their last voted slot. If there is a heavier bank on the last voted fork than the last voted bank itself, validators can erroneously reset to an invalid slot.

Summary of Changes

  • Allow resetting to a slot which matches the last vote slot
  • Disallow resetting to a slot which is part of an invalid fork
  • Fix tests

Fixes #

@jstarry jstarry changed the title Allow resetting to duplicate was confirmed Allow resetting to a duplicate slot that was confirmed by the cluster Oct 1, 2022
@jstarry
Copy link
Contributor Author

jstarry commented Oct 1, 2022

Turns out we already had a test for a very similar case: test_unconfirmed_duplicate_slots_and_lockouts. This PR still needs to modify or copy that test with a condition that makes the non voted fork the heavier fork and ensure that we can reset to the heaviest voted fork when the dup is confirmed but the switch threshold fails.

@jstarry jstarry marked this pull request as ready for review October 1, 2022 15:06
@jstarry
Copy link
Contributor Author

jstarry commented Oct 1, 2022

The latest commit has a test that fails on latest master but is fixed in this PR. This is now ready for review!

@jstarry
Copy link
Contributor Author

jstarry commented Oct 1, 2022

Just pushed a small patch to the failing test

);
// Should now pick 5 the heaviest fork from last vote again.
assert!(vote_fork.is_none());
assert_eq!(reset_fork.unwrap(), 5);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also would fail here if it didn't get stuck on the previous failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice gud test I like

carllin
carllin previously approved these changes Oct 1, 2022
return None;
match self.is_candidate(&last_voted_slot_hash) {
Some(true) => self.best_slot(&last_voted_slot_hash),
Some(false) => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a fine change, but trying to minimize as many logical changes as possible.

In this new code is_candidate(&last_voted_slot_hash) == false, this returns None. This seems like it might be a deviation from the original code.

To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies inthe old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding what you're suggesting that I change here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies in the old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?

I think he's saying that this is a new condition that wasn't checked in the old code that might now return None.
I don't think this implies in the old code that heaviest_slot_hash_on_same_voted_fork needed to be None, this just requires that there is an invalid ancestor which wasn't being checked before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that we should try to reduce the change set to the bare minimum here. we need to fix the known bug now, further hardening can wait.

kinda confusing that a comment like this would result in an approved review. seems blocking

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is already at the bare minimum in my opinion. Take note that self.is_candidate(..).is_none() and self.best_slot(..).is_none() is the same thing because each of them is only None when the queried slot is not in fork_infos.

The only logical change is switching from filtering the returned heaviest slot by fork_info.is_candidate rather than checking if it is the same slot as the last vote. Checking is_candidate is crucial here because otherwise we would build blocks from heaviest bank even if it was a duplicate.

Copy link
Contributor Author

@jstarry jstarry Oct 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson @carllin if you disagree on what the minimum change, please let me know. I'm also in favor of keeping the scope of this change as low as possible.

If we remove the last vote equality check and don't add the is_candidate check, then during the mainnet-beta issue, validators would have kept building off of 221 before it was determined to be confirmed. This might be a good thing actually but just wanted to call out that the last vote equality check happened to prevent this from happening.

The bug I intended to fix was that validators should've started building off of 221 after they detected it to be a confirmed duplicate, but didn't. Without is_candidate, the bug being fixed is that validators should've continued building off of 221 as soon as they started trying to switch to separate fork (225).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more digging and discussion with @carllin, it looks like we have to add the is_candidate check after all because otherwise we could get into situations where the heaviest_slot_on_same_voted_fork was purged (because the cluster confirmed a different version) and we try to reset to a purged bank. I think that avoiding resetting to invalid forks should be avoided at all costs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that avoiding resetting to invalid forks should be avoided at all costs.

this would be separate from the condition that halted MB though, right? i'm not saying that we shouldn't harden further, just that we can run anything extra through the usual (priority) stabilization/backport process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did agree to add the is_candidate check in the normal process but without it we expose validators to increased risk of panicking due to purged slots.

AshwinSekar
AshwinSekar previously approved these changes Oct 1, 2022
return None;
match self.is_candidate(&last_voted_slot_hash) {
Some(true) => self.best_slot(&last_voted_slot_hash),
Some(false) => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies in the old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?

I think he's saying that this is a new condition that wasn't checked in the old code that might now return None.
I don't think this implies in the old code that heaviest_slot_hash_on_same_voted_fork needed to be None, this just requires that there is an invalid ancestor which wasn't being checked before.

@mergify mergify bot dismissed stale reviews from AshwinSekar and carllin October 1, 2022 20:41

Pull request has been modified.

@jstarry jstarry added the automerge Merge this Pull Request automatically once CI passes label Oct 1, 2022
return None;
match self.is_candidate(&last_voted_slot_hash) {
Some(true) => self.best_slot(&last_voted_slot_hash),
Some(false) => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that we should try to reduce the change set to the bare minimum here. we need to fix the known bug now, further hardening can wait.

kinda confusing that a comment like this would result in an approved review. seems blocking

assert!(vote_fork.is_none());
assert_eq!(reset_fork.unwrap(), 5);

// Mark 5 as duplicate, 4 should be the heaviest slot, but should not be votable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert the latter two conditions as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are implicitly asserted by checking that vote bank is none

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment? 🥺

@t-nelson t-nelson removed the automerge Merge this Pull Request automatically once CI passes label Oct 1, 2022
@jstarry jstarry changed the title Allow resetting to a duplicate slot that was confirmed by the cluster Allow resetting to a duplicate slot while waiting to switch to another fork Oct 2, 2022
@jstarry jstarry changed the title Allow resetting to a duplicate slot while waiting to switch to another fork Allow validators to reset to the slot which matches their last voted slot Oct 2, 2022
@jstarry jstarry added the automerge Merge this Pull Request automatically once CI passes label Oct 3, 2022
@mergify mergify bot removed the automerge Merge this Pull Request automatically once CI passes label Oct 3, 2022
@mergify
Copy link
Contributor

mergify bot commented Oct 3, 2022

automerge label removed due to a CI failure

@jstarry jstarry merged commit c2bb2b8 into solana-labs:master Oct 3, 2022
@jstarry jstarry deleted the fix/reset-to-dup branch October 3, 2022 08:50
mergify bot pushed a commit that referenced this pull request Oct 3, 2022
…slot (#28172)

* Add failing test

* Allow resetting to duplicate was confirmed

* feedback

* feedback

* bump

* simplify change

* Revert "simplify change"

This reverts commit 72e5de3.

* update comment

* Update core/src/replay_stage.rs

(cherry picked from commit c2bb2b8)
mergify bot pushed a commit that referenced this pull request Oct 3, 2022
…slot (#28172)

* Add failing test

* Allow resetting to duplicate was confirmed

* feedback

* feedback

* bump

* simplify change

* Revert "simplify change"

This reverts commit 72e5de3.

* update comment

* Update core/src/replay_stage.rs

(cherry picked from commit c2bb2b8)
mergify bot pushed a commit that referenced this pull request Oct 3, 2022
…slot (#28172)

* Add failing test

* Allow resetting to duplicate was confirmed

* feedback

* feedback

* bump

* simplify change

* Revert "simplify change"

This reverts commit 72e5de3.

* update comment

* Update core/src/replay_stage.rs

(cherry picked from commit c2bb2b8)
mergify bot added a commit that referenced this pull request Oct 3, 2022
…slot (backport #28172) (#28185)

Allow validators to reset to the slot which matches their last voted slot (#28172)

* Add failing test

* Allow resetting to duplicate was confirmed

* feedback

* feedback

* bump

* simplify change

* Revert "simplify change"

This reverts commit 72e5de3.

* update comment

* Update core/src/replay_stage.rs

(cherry picked from commit c2bb2b8)

Co-authored-by: Justin Starry <[email protected]>
mergify bot added a commit that referenced this pull request Oct 3, 2022
…slot (backport #28172) (#28186)

Allow validators to reset to the slot which matches their last voted slot (#28172)

* Add failing test

* Allow resetting to duplicate was confirmed

* feedback

* feedback

* bump

* simplify change

* Revert "simplify change"

This reverts commit 72e5de3.

* update comment

* Update core/src/replay_stage.rs

(cherry picked from commit c2bb2b8)

Co-authored-by: Justin Starry <[email protected]>
mergify bot added a commit that referenced this pull request Oct 3, 2022
…slot (backport #28172) (#28187)

Allow validators to reset to the slot which matches their last voted slot (#28172)

* Add failing test

* Allow resetting to duplicate was confirmed

* feedback

* feedback

* bump

* simplify change

* Revert "simplify change"

This reverts commit 72e5de3.

* update comment

* Update core/src/replay_stage.rs

(cherry picked from commit c2bb2b8)

Co-authored-by: Justin Starry <[email protected]>
Comment on lines +2908 to +2915
/// If `heaviest_bank_on_same_voted_fork` is `None` due to that fork no
/// longer being valid to vote on, it's possible that a validator will not
/// be able to reset away from the invalid fork that they last voted on. To
/// resolve this scenario, validators need to wait until they can create a
/// switch proof for another fork or until the invalid fork is be marked
/// valid again if it was confirmed by the cluster.
/// Until this is resolved, leaders will build each of their
/// blocks from the last reset bank on the invalid fork.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -5605,6 +5614,139 @@ pub(crate) mod tests {
);
}

#[test]
fn test_unconfirmed_duplicate_slots_and_lockouts_for_non_heaviest_fork() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ textbook-level quite readable unit test.

Comment on lines +5653 to +5654
assert_eq!(vote_fork.unwrap(), 4);
assert_eq!(reset_fork.unwrap(), 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can just resort to () for adhoc type:

assert_eq!((vote_fork, reset_fork), (Some(4), Some(4));

this should give most helpful assert.

...in other words, rust rocks

same for below this paired asserts.

// (newer snapshot, or only a saved tower is moved over to new setup?)
return None;
match self.is_candidate(&last_voted_slot_hash) {
Some(true) => self.best_slot(&last_voted_slot_hash),
Copy link
Contributor

@ryoqun ryoqun Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mentioned at the different review thread, but i'd prefer provable .unwrap() here for less conditions: Some(self.best_slot(&last_voted_slot_hash).unwrap()).

Copy link
Contributor

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more (post-merge) lgtm from me, who did independent root cause analysis. :)

carllin added a commit to carllin/solana that referenced this pull request Oct 25, 2022
carllin added a commit to carllin/solana that referenced this pull request Dec 3, 2022
carllin added a commit to carllin/solana that referenced this pull request Dec 3, 2022
carllin added a commit to carllin/solana that referenced this pull request Dec 23, 2022
carllin added a commit to carllin/solana that referenced this pull request Dec 23, 2022
carllin added a commit to carllin/solana that referenced this pull request Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants