Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

[Obsoleted] [wip] Retain N epochs worth of epoch_stakes#7098

Closed
ryoqun wants to merge 10 commits intosolana-labs:masterfrom
ryoqun:retain-n-epoch-stakes
Closed

[Obsoleted] [wip] Retain N epochs worth of epoch_stakes#7098
ryoqun wants to merge 10 commits intosolana-labs:masterfrom
ryoqun:retain-n-epoch-stakes

Conversation

@ryoqun
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun commented Nov 22, 2019

Problem

From #6991:

the bank never culls epoch stakes

Solution

Again From #6991:

retain at most N epochs worth of epoch_stakes

Misc

  • I added @rob-solana as a reviewer guessing from issue creator can review this, nice to meet you!
  • Also, I tried to make added tests readable as much as possible. A bit too much, but I hope our test assets could be close to them by enriching test vocabulary like this...

Fixes #6991

@ryoqun ryoqun requested a review from rob-solana November 22, 2019 19:54
@ryoqun ryoqun self-assigned this Nov 22, 2019
Comment thread runtime/src/bank.rs Outdated
Comment thread runtime/src/bank.rs Outdated
Comment thread runtime/src/bank.rs
assert_eq!(bank1.block_height(), 1);
}

impl Bank {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments here would be awesome :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Piece of cake once I fix the snapshot issue. :)

rob-solana
rob-solana previously approved these changes Nov 22, 2019
Copy link
Copy Markdown
Contributor

@rob-solana rob-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

you might peruse staking_utils and the blocktree shred insertion code for assumptions that leader_schedule_stakes for past epochs are always present...

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 22, 2019

Codecov Report

Merging #7098 into master will decrease coverage by 4.9%.
The diff coverage is 59.2%.

@@           Coverage Diff            @@
##           master   #7098     +/-   ##
========================================
- Coverage    79.1%   74.2%     -5%     
========================================
  Files         230     230             
  Lines       44333   47320   +2987     
========================================
+ Hits        35099   35137     +38     
- Misses       9234   12183   +2949

@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Nov 22, 2019

lgtm

hooray!

you might peruse staking_utils and the blocktree shred insertion code for assumptions that leader_schedule_stakes for past epochs are always present...

Thanks for a pointer for possible relevant subsystems! I'll check it before merging.

@mergify mergify Bot dismissed rob-solana’s stale review November 22, 2019 20:48

Pull request has been modified.

@ryoqun ryoqun force-pushed the retain-n-epoch-stakes branch from 83f489f to 358046f Compare November 25, 2019 08:08
// Calculate the schedule for all epochs between 0 and leader_schedule_epoch(root)
let leader_schedule_epoch = epoch_schedule.get_leader_schedule_epoch(root_bank.slot());
for epoch in 0..leader_schedule_epoch {
for epoch in (leader_schedule_epoch.max(MAX_LEADER_SCHEDULE_STAKES)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might come out prettier if the bank gave you the epochs for which it had leader_schedule_stakes...

Comment thread runtime/src/bank.rs

pub const SECONDS_PER_YEAR: f64 = (365.25 * 24.0 * 60.0 * 60.0);

pub const MAX_LEADER_SCHEDULE_STAKES: Epoch = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is small for testing? I thought we were gonna use 16 or 32

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Oh, sorry for the lack of comment. I'm intentionally reducing this to cause CI fragile. :D

@rob-solana At my current understanding with a day of code reading/testing, the assumptions that leader_schedule_stakes for past epochs are always present doesn't cause hard errors for validators (no unwrap on the data derived from bank.epoch_stakes), but might silently stop operating or skip (old) legimate tasks (unwrap_or/unwrap_defaults). So, I'm chasing the possible problem due to new retaining behavior.

.max(MAX_LEADER_SCHEDULE_STAKES)
- MAX_LEADER_SCHEDULE_STAKES
{
panic!(
Copy link
Copy Markdown
Contributor

@sagar-solana sagar-solana Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure its best to crash here.

What if someone sends our validator a Shred that's really old, it'll cause the validator to panic.
Actually that's a bad example(but hopefully it gets the point across), I think the window_service should restructure its checks such that it doesn't risk querying a leader schedule that far back.

Since the function returns an Option, why not just return None and log an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the window_service should restructure its checks such that it doesn't risk querying a leader schedule that far back. ... Since the function returns an Option, why not just return None and log an error?

Yes, that will be what I'd write as the final form of this PR!

I'm doing panic! here just to know how much other code is affected by this PR's change. So I should comment as such. Sorry for being less explicit about the intention. And thanks for checking my PR!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Thanks for adding the [wip] in the title :) Maybe a Draft PR in future experiments?
Also the panic won't catch all possibilities. Like the RPC one I described but it's perfectly fine for what you're trying to do right now. Also imo snapshot stuff > this.

Comment thread runtime/src/bank.rs
self.epoch_stakes
.remove(&(leader_schedule_epoch - MAX_LEADER_SCHEDULE_STAKES));
}
error!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the log level should be lower than error.

Just to stand out for testing purpose as said above. Sorry for misleading you!

Copy link
Copy Markdown
Contributor

@sagar-solana sagar-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks fine but I'm a little worried about putting panics into get_slot_leader. If we add an RPC api to get leader schedule, it's quite easy to overlook and will make it trivial to crash the validator.

@ryoqun ryoqun changed the title Retain N epochs worth of epoch_stakes [wip] Retain N epochs worth of epoch_stakes Nov 26, 2019
@rob-solana
Copy link
Copy Markdown
Contributor

still in progress, @ryoqun ?

@stale
Copy link
Copy Markdown

stale Bot commented Dec 17, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 17, 2019
@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Dec 19, 2019

@rob-solana Sorry for late reply. Sadly, this is not progressing here because I'm concentrating on snapshots recently.

@stale stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 19, 2019
@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Dec 19, 2019

This PR has to update recently updated GetLeaderSchedule API as well with this PR as the API started to return user-specified epoch's leader schedules. If we retain at most N epoches worth of data, the API will start to return None or worse just error out... https://github.com/solana-labs/solana/pull/7542/files#diff-3857de88b9c51a001730fa012a6951b2R721

@stale
Copy link
Copy Markdown

stale Bot commented Dec 26, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale Bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 26, 2019
@stale
Copy link
Copy Markdown

stale Bot commented Jan 2, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale Bot closed this Jan 2, 2020
@rob-solana
Copy link
Copy Markdown
Contributor

@mvines might be a source of the size of the bank on a live network

@rob-solana rob-solana reopened this Jan 2, 2020
@stale stale Bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 2, 2020
@mvines mvines added this to the v0.21.7 milestone Jan 3, 2020
@mvines
Copy link
Copy Markdown
Contributor

mvines commented Jan 3, 2020

🎯 parent.epoch_stakes.clone() in Bank::new_from_parent() eats up 30MB RAM on my test ledger

@mvines
Copy link
Copy Markdown
Contributor

mvines commented Jan 4, 2020

Obsoleted by #7668

@mvines mvines closed this Jan 4, 2020
@rob-solana
Copy link
Copy Markdown
Contributor

this PR stalled because RPC APIs might crash if an epoch stakes request was outside what's been retained

@mvines
Copy link
Copy Markdown
Contributor

mvines commented Jan 6, 2020

Oh do you recall which RPC API that was?

@ryoqun
Copy link
Copy Markdown
Contributor Author

ryoqun commented Jan 6, 2020

@mvines Thanks for taking over this! I think the affected RPC API is GetLeaderSchedule at least as long as I'm aware of. @rob-solana might be pointing to others.

@mvines
Copy link
Copy Markdown
Contributor

mvines commented Jan 6, 2020

Oh I see. I think it's fine for now if the GetLeaderSchedule RPC API fails for older epochs. The reason I added that support was for the solana show-block-production command, but I never had the expectation that one could run solana show-block-production for all epochs since 0 (and in fact this would fail for other reasons too, since with the solana-validator --limit-ledger-size argument we prune blocktree data from older epochs anyway)

@ryoqun ryoqun changed the title [wip] Retain N epochs worth of epoch_stakes Obsoleted][wip] Retain N epochs worth of epoch_stakes Jan 6, 2020
@ryoqun ryoqun changed the title Obsoleted][wip] Retain N epochs worth of epoch_stakes [Obsoleted] [wip] Retain N epochs worth of epoch_stakes Jan 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bank.epoch_stakes grows without bound

4 participants