[Obsoleted] [wip] Retain N epochs worth of epoch_stakes by ryoqun · Pull Request #7098 · solana-labs/solana

ryoqun · 2019-11-22T19:54:24Z

Problem

From #6991:

the bank never culls epoch stakes

Solution

Again From #6991:

retain at most N epochs worth of epoch_stakes

Misc

I added @rob-solana as a reviewer guessing from issue creator can review this, nice to meet you!
Also, I tried to make added tests readable as much as possible. A bit too much, but I hope our test assets could be close to them by enriching test vocabulary like this...

Fixes #6991

rob-solana · 2019-11-22T20:17:01Z

        assert_eq!(bank1.block_height(), 1);
    }

+    impl Bank {


A few more comments here would be awesome :)

Thanks for the feedback! Piece of cake once I fix the snapshot issue. :)

rob-solana

lgtm

you might peruse staking_utils and the blocktree shred insertion code for assumptions that leader_schedule_stakes for past epochs are always present...

codecov · 2019-11-22T20:32:44Z

Codecov Report

Merging #7098 into master will decrease coverage by 4.9%.
The diff coverage is 59.2%.

@@           Coverage Diff            @@
##           master   #7098     +/-   ##
========================================
- Coverage    79.1%   74.2%     -5%     
========================================
  Files         230     230             
  Lines       44333   47320   +2987     
========================================
+ Hits        35099   35137     +38     
- Misses       9234   12183   +2949

ryoqun · 2019-11-22T20:46:27Z

lgtm

hooray!

you might peruse staking_utils and the blocktree shred insertion code for assumptions that leader_schedule_stakes for past epochs are always present...

Thanks for a pointer for possible relevant subsystems! I'll check it before merging.

Pull request has been modified.

rob-solana · 2019-11-25T20:07:04Z

        // Calculate the schedule for all epochs between 0 and leader_schedule_epoch(root)
        let leader_schedule_epoch = epoch_schedule.get_leader_schedule_epoch(root_bank.slot());
-        for epoch in 0..leader_schedule_epoch {
+        for epoch in (leader_schedule_epoch.max(MAX_LEADER_SCHEDULE_STAKES)


this might come out prettier if the bank gave you the epochs for which it had leader_schedule_stakes...

rob-solana · 2019-11-25T20:07:32Z


 pub const SECONDS_PER_YEAR: f64 = (365.25 * 24.0 * 60.0 * 60.0);

+pub const MAX_LEADER_SCHEDULE_STAKES: Epoch = 3;


this is small for testing? I thought we were gonna use 16 or 32

Yes. Oh, sorry for the lack of comment. I'm intentionally reducing this to cause CI fragile. :D

@rob-solana At my current understanding with a day of code reading/testing, the assumptions that leader_schedule_stakes for past epochs are always present doesn't cause hard errors for validators (no unwrap on the data derived from bank.epoch_stakes), but might silently stop operating or skip (old) legimate tasks (unwrap_or/unwrap_defaults). So, I'm chasing the possible problem due to new retaining behavior.

sagar-solana · 2019-11-26T18:20:50Z

+                .max(MAX_LEADER_SCHEDULE_STAKES)
+                - MAX_LEADER_SCHEDULE_STAKES
+        {
+            panic!(


I'm not sure its best to crash here.

What if someone sends our validator a Shred that's really old, it'll cause the validator to panic.
Actually that's a bad example(but hopefully it gets the point across), I think the window_service should restructure its checks such that it doesn't risk querying a leader schedule that far back.

Since the function returns an Option, why not just return None and log an error?

I think the window_service should restructure its checks such that it doesn't risk querying a leader schedule that far back. ... Since the function returns an Option, why not just return None and log an error?

Yes, that will be what I'd write as the final form of this PR!

I'm doing panic! here just to know how much other code is affected by this PR's change. So I should comment as such. Sorry for being less explicit about the intention. And thanks for checking my PR!

Gotcha! Thanks for adding the [wip] in the title :) Maybe a Draft PR in future experiments?
Also the panic won't catch all possibilities. Like the RPC one I described but it's perfectly fine for what you're trying to do right now. Also imo snapshot stuff > this.

sagar-solana · 2019-11-26T18:25:36Z

+                self.epoch_stakes
+                    .remove(&(leader_schedule_epoch - MAX_LEADER_SCHEDULE_STAKES));
+            }
+            error!(


Yes, the log level should be lower than error.

Just to stand out for testing purpose as said above. Sorry for misleading you!

sagar-solana

Overall looks fine but I'm a little worried about putting panics into get_slot_leader. If we add an RPC api to get leader schedule, it's quite easy to overlook and will make it trivial to crash the validator.

rob-solana · 2019-12-10T22:55:54Z

still in progress, @ryoqun ?

stale · 2019-12-17T23:45:02Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

ryoqun · 2019-12-19T07:00:46Z

@rob-solana Sorry for late reply. Sadly, this is not progressing here because I'm concentrating on snapshots recently.

ryoqun · 2019-12-19T07:02:38Z

This PR has to update recently updated GetLeaderSchedule API as well with this PR as the API started to return user-specified epoch's leader schedules. If we retain at most N epoches worth of data, the API will start to return None or worse just error out... https://github.com/solana-labs/solana/pull/7542/files#diff-3857de88b9c51a001730fa012a6951b2R721

stale · 2019-12-26T07:34:54Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-01-02T08:18:14Z

This stale pull request has been automatically closed. Thank you for your contributions.

rob-solana · 2020-01-02T21:16:29Z

@mvines might be a source of the size of the bank on a live network

mvines · 2020-01-03T16:32:20Z

🎯 parent.epoch_stakes.clone() in Bank::new_from_parent() eats up 30MB RAM on my test ledger

mvines · 2020-01-04T06:39:59Z

Obsoleted by #7668

rob-solana · 2020-01-06T05:43:30Z

this PR stalled because RPC APIs might crash if an epoch stakes request was outside what's been retained

mvines · 2020-01-06T06:06:27Z

Oh do you recall which RPC API that was?

ryoqun · 2020-01-06T06:35:25Z

@mvines Thanks for taking over this! I think the affected RPC API is GetLeaderSchedule at least as long as I'm aware of. @rob-solana might be pointing to others.

mvines · 2020-01-06T06:47:46Z

Oh I see. I think it's fine for now if the GetLeaderSchedule RPC API fails for older epochs. The reason I added that support was for the solana show-block-production command, but I never had the expectation that one could run solana show-block-production for all epochs since 0 (and in fact this would fail for other reasons too, since with the solana-validator --limit-ledger-size argument we prune blocktree data from older epochs anyway)

ryoqun requested a review from rob-solana November 22, 2019 19:54

ryoqun self-assigned this Nov 22, 2019

rob-solana reviewed Nov 22, 2019

View reviewed changes

Comment thread runtime/src/bank.rs Outdated

rob-solana reviewed Nov 22, 2019

View reviewed changes

Comment thread runtime/src/bank.rs Outdated

rob-solana reviewed Nov 22, 2019

View reviewed changes

rob-solana previously approved these changes Nov 22, 2019

View reviewed changes

ryoqun added 6 commits November 24, 2019 18:43

Retain N epochs worth of epoch_stakes

91baaea

Fix rustc warning

7df99cb

Address review comments

aabc0a3

Forgot to update tests; Accidental constant change detection, proved ;)

eff19e6

Test on CI if anything should be broken

c60c1fd

rustfmt

358046f

ryoqun force-pushed the retain-n-epoch-stakes branch from 83f489f to 358046f Compare November 25, 2019 08:08

ryoqun added 3 commits November 25, 2019 17:24

fix overflow...

677818b

Also adjust test assertions...

25ecb99

Fix test?

2380be3

rob-solana requested a review from sagar-solana November 25, 2019 20:05

rob-solana reviewed Nov 25, 2019

View reviewed changes

Really fix test?

ba07af6

sagar-solana reviewed Nov 26, 2019

View reviewed changes

ryoqun changed the title ~~Retain N epochs worth of epoch_stakes~~ [wip] Retain N epochs worth of epoch_stakes Nov 26, 2019

stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 17, 2019

stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 19, 2019

stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 26, 2019

stale Bot closed this Jan 2, 2020

rob-solana reopened this Jan 2, 2020

stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 2, 2020

mvines added this to the v0.21.7 milestone Jan 3, 2020

mvines mentioned this pull request Jan 4, 2020

bank: Prune older epoch stakes #7668

Merged

mvines closed this Jan 4, 2020

ryoqun changed the title ~~[wip] Retain N epochs worth of epoch_stakes~~ Obsoleted][wip] Retain N epochs worth of epoch_stakes Jan 6, 2020

ryoqun changed the title ~~Obsoleted][wip] Retain N epochs worth of epoch_stakes~~ [Obsoleted] [wip] Retain N epochs worth of epoch_stakes Jan 6, 2020


		pub const SECONDS_PER_YEAR: f64 = (365.25 * 24.0 * 60.0 * 60.0);

		pub const MAX_LEADER_SCHEDULE_STAKES: Epoch = 3;

Conversation

ryoqun commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Misc

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rob-solana left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ryoqun commented Nov 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagar-solana Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagar-solana left a comment

Choose a reason for hiding this comment

Uh oh!

rob-solana commented Dec 10, 2019

Uh oh!

stale Bot commented Dec 17, 2019

Uh oh!

ryoqun commented Dec 19, 2019

Uh oh!

ryoqun commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stale Bot commented Dec 26, 2019

Uh oh!

stale Bot commented Jan 2, 2020

Uh oh!

rob-solana commented Jan 2, 2020

Uh oh!

mvines commented Jan 3, 2020

Uh oh!

mvines commented Jan 4, 2020

Uh oh!

rob-solana commented Jan 6, 2020

Uh oh!

mvines commented Jan 6, 2020

Uh oh!

ryoqun commented Jan 6, 2020

Uh oh!

mvines commented Jan 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryoqun commented Nov 22, 2019 •

edited

Loading

codecov Bot commented Nov 22, 2019 •

edited

Loading

sagar-solana Nov 26, 2019 •

edited

Loading

ryoqun commented Dec 19, 2019 •

edited

Loading