[Obsoleted] [wip] Retain N epochs worth of epoch_stakes#7098
[Obsoleted] [wip] Retain N epochs worth of epoch_stakes#7098ryoqun wants to merge 10 commits intosolana-labs:masterfrom
Conversation
| assert_eq!(bank1.block_height(), 1); | ||
| } | ||
|
|
||
| impl Bank { |
There was a problem hiding this comment.
A few more comments here would be awesome :)
There was a problem hiding this comment.
Thanks for the feedback! Piece of cake once I fix the snapshot issue. :)
rob-solana
left a comment
There was a problem hiding this comment.
lgtm
you might peruse staking_utils and the blocktree shred insertion code for assumptions that leader_schedule_stakes for past epochs are always present...
Codecov Report
@@ Coverage Diff @@
## master #7098 +/- ##
========================================
- Coverage 79.1% 74.2% -5%
========================================
Files 230 230
Lines 44333 47320 +2987
========================================
+ Hits 35099 35137 +38
- Misses 9234 12183 +2949 |
hooray!
Thanks for a pointer for possible relevant subsystems! I'll check it before merging. |
Pull request has been modified.
83f489f to
358046f
Compare
| // Calculate the schedule for all epochs between 0 and leader_schedule_epoch(root) | ||
| let leader_schedule_epoch = epoch_schedule.get_leader_schedule_epoch(root_bank.slot()); | ||
| for epoch in 0..leader_schedule_epoch { | ||
| for epoch in (leader_schedule_epoch.max(MAX_LEADER_SCHEDULE_STAKES) |
There was a problem hiding this comment.
this might come out prettier if the bank gave you the epochs for which it had leader_schedule_stakes...
|
|
||
| pub const SECONDS_PER_YEAR: f64 = (365.25 * 24.0 * 60.0 * 60.0); | ||
|
|
||
| pub const MAX_LEADER_SCHEDULE_STAKES: Epoch = 3; |
There was a problem hiding this comment.
this is small for testing? I thought we were gonna use 16 or 32
There was a problem hiding this comment.
Yes. Oh, sorry for the lack of comment. I'm intentionally reducing this to cause CI fragile. :D
@rob-solana At my current understanding with a day of code reading/testing, the assumptions that leader_schedule_stakes for past epochs are always present doesn't cause hard errors for validators (no unwrap on the data derived from bank.epoch_stakes), but might silently stop operating or skip (old) legimate tasks (unwrap_or/unwrap_defaults). So, I'm chasing the possible problem due to new retaining behavior.
| .max(MAX_LEADER_SCHEDULE_STAKES) | ||
| - MAX_LEADER_SCHEDULE_STAKES | ||
| { | ||
| panic!( |
There was a problem hiding this comment.
I'm not sure its best to crash here.
What if someone sends our validator a Shred that's really old, it'll cause the validator to panic.
Actually that's a bad example(but hopefully it gets the point across), I think the window_service should restructure its checks such that it doesn't risk querying a leader schedule that far back.
Since the function returns an Option, why not just return None and log an error?
There was a problem hiding this comment.
I think the window_service should restructure its checks such that it doesn't risk querying a leader schedule that far back. ... Since the function returns an Option, why not just return None and log an error?
Yes, that will be what I'd write as the final form of this PR!
I'm doing panic! here just to know how much other code is affected by this PR's change. So I should comment as such. Sorry for being less explicit about the intention. And thanks for checking my PR!
There was a problem hiding this comment.
Gotcha! Thanks for adding the [wip] in the title :) Maybe a Draft PR in future experiments?
Also the panic won't catch all possibilities. Like the RPC one I described but it's perfectly fine for what you're trying to do right now. Also imo snapshot stuff > this.
| self.epoch_stakes | ||
| .remove(&(leader_schedule_epoch - MAX_LEADER_SCHEDULE_STAKES)); | ||
| } | ||
| error!( |
There was a problem hiding this comment.
Yes, the log level should be lower than error.
Just to stand out for testing purpose as said above. Sorry for misleading you!
sagar-solana
left a comment
There was a problem hiding this comment.
Overall looks fine but I'm a little worried about putting panics into get_slot_leader. If we add an RPC api to get leader schedule, it's quite easy to overlook and will make it trivial to crash the validator.
|
still in progress, @ryoqun ? |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
@rob-solana Sorry for late reply. Sadly, this is not progressing here because I'm concentrating on snapshots recently. |
|
This PR has to update recently updated GetLeaderSchedule API as well with this PR as the API started to return user-specified epoch's leader schedules. If we retain at most N epoches worth of data, the API will start to return None or worse just error out... https://github.com/solana-labs/solana/pull/7542/files#diff-3857de88b9c51a001730fa012a6951b2R721 |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
|
@mvines might be a source of the size of the bank on a live network |
|
🎯 |
|
Obsoleted by #7668 |
|
this PR stalled because RPC APIs might crash if an epoch stakes request was outside what's been retained |
|
Oh do you recall which RPC API that was? |
|
@mvines Thanks for taking over this! I think the affected RPC API is |
|
Oh I see. I think it's fine for now if the |
Problem
From #6991:
Solution
Again From #6991:
Misc
Fixes #6991