Persistent tower by ryoqun · Pull Request #10718 · solana-labs/solana

ryoqun · 2020-06-19T11:39:58Z

Problem

tower isn't persisted.

Summary of Changes

Persist it

Todo

Address any remaining review comments from Save tower state persistently #7436 Save/restore Tower #9902

Fixes #6936

codecov · 2020-06-19T18:14:04Z

Codecov Report

Merging #10718 into master will increase coverage by 0.1%.
The diff coverage is 91.7%.

@@            Coverage Diff            @@
##           master   #10718     +/-   ##
=========================================
+ Coverage    82.1%    82.2%   +0.1%     
=========================================
  Files         341      341             
  Lines       80245    81022    +777     
=========================================
+ Hits        65896    66633    +737     
- Misses      14349    14389     +40

ryoqun · 2020-09-16T19:31:42Z

@carllin ~~I have yet to address #10718 (comment)~~ (EDIT: done) and https://github.com/solana-labs/solana/pull/10718/files#diff-a6e7be832768ca536ac5b7450a42ee09R1479. But, I've addressed all review comments regarding new strict handling around root of Tower. I hope this commit explain things enough for your concerns: 0bfb822

EDIT: I've also managed to simplify the adjust logic a bit: dd25003 df55bfb

ryoqun · 2020-09-17T05:30:32Z

I think this should still be safe with the changes here, because it should be equivalent to starting from a snapshot for slot 6 from the network, but I wanted to confirm with you that there's no edge cases.

Yeah, I think this is safe too. I lightly read #11727.

ryoqun · 2020-09-17T09:29:39Z


+#[test]
+#[serial]
+#[ignore]


Towards addressing this review comment: #10718 (comment)

@carllin Well, I found it's very hard to make a validator violate optimistic confirmation if and only if tower is removed.
The hard part is that marking a slot as dead makes a validator to violate the optimistic conf. even if tower is persited. And eventually, the cluster makes new roots.

The initial situation is like this:

A: validator A (observer) B: validator B (violator, will be restarted) A & B: 0 -> 1 -> 2 -> ... -> 50 -> 51 (last vote for A and B; thus opt. confirmed)

Then mark 51 as dead only on B and B's new view after restart:

[...] = not in bank_forks B: 0 -> 1 -> 2 -> ... -> 50 [-> 51 (stray last vote for B)]

Even if we keep (optimistically-confirmed) slot 51 in towrer, B's bank_forks doesn't have the ancestors info for slot 51 due to forcibly marking 51 as dead. So, this falls into this clause: let last_vote_ancestors = ... { &empty_ancestors };. Then, !last_vote_ancestors.contains(lockout_interval_start) always evaluates to true. So we can't correctly exclude A's stake using A's lockout_interval, which is actually ancestor of slot 51. Finally, B then starts to vote, creating a new fork like this:

B: 0 -> 1 -> 2 -> ... -> 49 -> 50 [-> 51 (stray last vote for B)] | +-> 56 -> 57

Ultimately, A starts to vote on the new bad fork, resulting in violated optimistic confirmation for the slot 51.

The forced dead slot 51 is the problem. But this shouldn't occur in practice. Usually, we should have ancestors for any voted slots in bank_forks. The only exception is newer snapshot. Snapshot is by-definition a new trustworthy root, so we can safely ignore unavailable ancestors for any stray last vote. Thus, this isn't bug per se.

But I couldn't come up with a good test scenario/strategy. Any suggestions?

@ryoqun, ah I see! Does that mean we can remove this: https://github.com/solana-labs/solana/pull/10718/files#diff-a6e7be832768ca536ac5b7450a42ee09R1439-R1444 as well now?

@ryoqun Hmm I couldn't come up with a simple example, but here's an interesting alternative idea:

You have validator A + B with 31% and 36% of the stake:

0 -> 1 -> 2 -> 3 (A + B vote, optimistically confirmed)

Turn off A + B, and truncate the ledger after slot 3 (simulate votes not landing in next slot). Start validator C with 33% of the stake with same ledger, but only up to slot 2. Have C generate some blocks like:

0 -> 1-> 2 -> 4

Then restart A which had 31% of the stake. With the tower, from A's perspective it sees:

0 -> 1 -> 2 -> 3 (voted) | -> 4 -> 5 (C's vote for 4)

The fork choice rule weights look like:

0 -> 1 -> 2 (ABC) -> 3 | -> 4 (C) -> 5

Without the tower, A would choose to vote on the fork with 4->5. This is true even if A generates a new fork starting at slot 3 because C has more stake than A so A will eventually pick the fork C is on. Furthermore B's vote on 3 is not observable because there are no descendants of slot 3, so that fork will not be chose over C's fork

With the persisted tower however, A should not be able to generate a switching proof.

oh this looks interesting case. Because this isn't critical part of this pr, I'll address this as a follow-up pr.

First let's land this 1700-lines pr to free me of rebasing... ;)

yeah totally agree 😄

carllin · 2020-09-19T01:58:49Z

@ryoqun I think we should wait to check this in until after the mainnet upgrade so that it can bake on Tds for a while

ryoqun · 2020-09-19T02:11:48Z

@ryoqun I think we should wait to check this in until after the mainnet upgrade so that it can bake on Tds for a while

@carllin Whoa. This pr finally got LGTM-ed? Yay!!!

So, let's target v1.4? So, merging is ok? and backporting (to v1.3) is not ok?. I'm getting tired of re-re-re-re-...-rebases, maybe another one due to this #12340. ;)

carllin · 2020-09-19T02:13:16Z

@ryoqun yeah lets target 1.4!

mvines · 2020-09-19T02:39:34Z

So, let's target v1.4? So, merging is ok? and backporting (to v1.3) is not ok?. I'm getting tired of re-re-re-re-...-rebases, maybe another one due to this #12340. ;)

Sorry! Want me to wait until you merge this PR?

ryoqun · 2020-09-19T03:20:12Z

So, let's target v1.4? So, merging is ok? and backporting (to v1.3) is not ok?. I'm getting tired of re-re-re-re-...-rebases, maybe another one due to this #12340. ;)

Sorry! Want me to wait until you merge this PR?

@mvines if it's not too painful and/or time-pressured for you to rebase, please wait. I'll merge this pr in 30-min or so after final last-minute checking.

ryoqun · 2020-10-16T03:17:41Z

hmm. this isn't super fun; but it looks like even un-fsynced tower save takes around 7ms (medium):

And this directly delays that amount for any level of confirmations.

CC: @carllin

ryoqun · 2020-10-16T03:24:25Z

come to think of it, I remembered we sign the saved tower.

ryoqun · 2020-10-23T23:40:23Z

Weekly update on the perf and error monitoring!

it seems that mean of tower save time is still high. I'd like this to be well below 1ms..

After the first upgrade with tower.

Also, it seems that there is no tower_error/tower_warn/panic metric indicating some bug in persistent tower (checked with chronograf). (Well, #12671 isn't landed, though). Already 200+ validators experienced the upgrade:

$ ./target/release/solana --url http://testnet.solana.com validators
Active Stake: 16705542.793110987 SOL
Current Stake: 16493998.208943831 SOL (98.73%)
Delinquent Stake: 211544.584167159 SOL (1.27%)

Stake By Version:
1.3.11   -   0 current validators ( 0.00%),   1 delinquent validators ( 0.00%)
1.3.17   -   0 current validators ( 0.00%),   1 delinquent validators ( 0.00%)
1.4.1    - 111 current validators (26.41%),   6 delinquent validators ( 0.72%)
1.4.2    - 258 current validators (72.32%),   4 delinquent validators ( 0.36%)
unknown  -   0 current validators ( 0.00%),  69 delinquent validators ( 0.18%)

...

* Save/restore Tower * Avoid unwrap() * Rebase cleanups * Forcibly pass test * Correct reconcilation of votes after validator resume * d b g * Add more tests * fsync and fix test * Add test * Fix fmt * Debug * Fix tests... * save * Clarify error message and code cleaning around it * Move most of code out of tower save hot codepath * Proper comment for the lack of fsync on tower * Clean up * Clean up * Simpler type alias * Manage tower-restored ancestor slots without banks * Add comment * Extract long code blocks... * Add comment * Simplify returned tuple... * Tweak too aggresive log * Fix typo... * Add test * Update comment * Improve test to require non-empty stray restored slots * Measure tower save and dump all tower contents * Log adjust and add threshold related assertions * cleanup adjust * Properly lower stray restored slots priority... * Rust fmt * Fix test.... * Clarify comments a bit and add TowerError::TooNew * Further clean-up arround TowerError * Truly create ancestors by excluding last vote slot * Add comment for stray_restored_slots * Add comment for stray_restored_slots * Use BTreeSet * Consider root_slot into post-replay adjustment * Tweak logging * Add test for stray_restored_ancestors * Reorder some code * Better names for unit tests * Add frozen_abi to SavedTower * Fold long lines * Tweak stray ancestors and too old slot history * Re-adjust error conditon of too old slot history * Test normal ancestors is checked before stray ones * Fix conflict, update tests, adjust behavior a bit * Fix test * Address review comments * Last touch! * Immediately after creating cleaning pr * Revert stray slots * Revert comment... * Report error as metrics * Revert not to panic! and ignore unfixable test... * Normalize lockouts.root_slot more strictly * Add comments for panic! and more assertions * Proper initialize root without vote account * Clarify code and comments based on review feedback * Fix rebase * Further simplify based on assured tower root * Reorder code for more readability Co-authored-by: Michael Vines <mvines@gmail.com>

ryoqun marked this pull request as draft June 19, 2020 11:40

ryoqun mentioned this pull request Jun 19, 2020

Submit a vote timestamp every vote #10630

Merged

This was referenced Jun 22, 2020

Remove unused StakeLockout::lockout #10719

Merged

Remove unused ComputedBankState::lockout_intervals #10733

Closed

Add {Vote, Tower}::last_voted_slot() #10734

Merged

Further expand last_voted_slot terminology #10747

Merged

ryoqun force-pushed the persistent-tower branch 2 times, most recently from 2a50708 to e53a9cb Compare June 23, 2020 13:04

ryoqun commented Jun 24, 2020

View reviewed changes

Comment thread core/src/consensus.rs

ryoqun commented Jun 24, 2020

View reviewed changes

Comment thread sdk/src/slot_history.rs Outdated

ryoqun commented Jun 24, 2020

View reviewed changes

Comment thread core/src/consensus.rs Outdated

ryoqun commented Jun 24, 2020

View reviewed changes

Comment thread local-cluster/tests/local_cluster.rs

This was referenced Jun 25, 2020

Fix race in ci/run-sanity.sh #10796

Merged

Expose SlotHistory::oldest() -> Slot #10799

Merged

Tiny cleanups in preparation for persistent tower #10805

Merged

ryoqun commented Jun 25, 2020

View reviewed changes

Comment thread core/src/consensus.rs Outdated

ryoqun commented Jun 25, 2020

View reviewed changes

Comment thread core/src/consensus.rs Outdated

ryoqun force-pushed the persistent-tower branch from 36494de to 5fb5b8e Compare June 25, 2020 15:25

ryoqun commented Jun 25, 2020

View reviewed changes

Comment thread core/src/validator.rs Outdated

ryoqun commented Jun 25, 2020

View reviewed changes

Comment thread core/src/heaviest_subtree_fork_choice.rs Outdated

ryoqun force-pushed the persistent-tower branch from 5fb5b8e to ac24408 Compare June 26, 2020 05:34

ryoqun commented Jun 26, 2020

View reviewed changes

Comment thread core/src/consensus.rs Outdated

This was referenced Jun 29, 2020

TdS died 1.2.6 #10821

Closed

Consistently create temp dirs under ledger/farf #10848

Merged

ryoqun commented Jun 30, 2020

View reviewed changes

Comment thread runtime/src/accounts_index.rs

ryoqun force-pushed the persistent-tower branch from 71c6ba3 to dd1cfb8 Compare July 1, 2020 03:35

ryoqun commented Jul 1, 2020

View reviewed changes

Comment thread core/src/consensus.rs Outdated

ryoqun commented Jul 1, 2020

View reviewed changes

Comment thread core/src/validator.rs Outdated

ryoqun mentioned this pull request Jul 1, 2020

Rename new_banks_from_blockstore => new_banks_from_ledger #10864

Merged

ryoqun added 2 commits September 17, 2020 03:58

Clarify code and comments based on review feedback

0bfb822

Fix rebase

49c9e77

ryoqun force-pushed the persistent-tower branch from 9969102 to 49c9e77 Compare September 16, 2020 19:04

ryoqun added 2 commits September 17, 2020 13:25

Further simplify based on assured tower root

dd25003

Reorder code for more readability

df55bfb

ryoqun commented Sep 17, 2020

View reviewed changes

ryoqun requested a review from carllin September 18, 2020 04:31

carllin approved these changes Sep 19, 2020

View reviewed changes

phongbuigk approved these changes Sep 19, 2020

View reviewed changes

ryoqun removed the v1.3 label Sep 19, 2020

ryoqun merged commit cb8661b into solana-labs:master Sep 19, 2020

ryoqun mentioned this pull request Sep 19, 2020

Follow up to persistent tower with tests and API cleaning #12350

Merged

This was referenced Oct 12, 2020

Fix tower/blockstore unsync due to external causes #12671

Merged

Better tower logs for SwitchForkDecision and etc #12875

Merged

ryoqun mentioned this pull request Oct 22, 2020

Clean up opt conf verifier and vote state tracker #13081

Merged

This was referenced Oct 24, 2020

Allow existence of vote on root in saved tower #13135

Merged

Fix test_optimistic_confirmation_violation_without_tower() #13043

Merged

CriesofCarrots mentioned this pull request Dec 4, 2020

v1.3 persistent tower #13961

Closed

Conversation

ryoqun commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Todo

Uh oh!

codecov Bot commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryoqun commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryoqun commented Sep 17, 2020

Uh oh!

ryoqun Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin Sep 19, 2020

Choose a reason for hiding this comment

Uh oh!

carllin Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Sep 19, 2020

Choose a reason for hiding this comment

Uh oh!

carllin Sep 19, 2020

Choose a reason for hiding this comment

Uh oh!

carllin commented Sep 19, 2020

Uh oh!

ryoqun commented Sep 19, 2020

Uh oh!

carllin commented Sep 19, 2020

Uh oh!

mvines commented Sep 19, 2020

Uh oh!

ryoqun commented Sep 19, 2020

Uh oh!

ryoqun commented Oct 16, 2020

Uh oh!

ryoqun commented Oct 16, 2020

Uh oh!

ryoqun commented Oct 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryoqun commented Jun 19, 2020 •

edited

Loading

codecov Bot commented Jun 19, 2020 •

edited

Loading

ryoqun commented Sep 16, 2020 •

edited

Loading

ryoqun Sep 17, 2020 •

edited

Loading

carllin Sep 19, 2020 •

edited

Loading