-
Notifications
You must be signed in to change notification settings - Fork 146
go: Fix runtime pruning #6355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
go: Fix runtime pruning #6355
Conversation
✅ Deploy Preview for oasisprotocol-oasis-core canceled.
|
b63f4ef to
26e3fd9
Compare
|
This PR tries to fix existing problem simply, by avoiding being "perfect". However, for an ideal solution, we should first answer the question below, which also serves as a complementary context to what we are trying to solve here: Who should be responsible for pruning of the runtime state?ContextSeemingly trivial refactor clashes heavily with our current design/code organization. Since it is not the first time that we are clashing with this specific abstractions, I want this change to also be a step towards making our code more maintainable instead of complicating it further. How things work currentlyThere are two databases used for maintaining runtime state:
Who populates state instances?Light history is populated by the State DB is is populated by the storage committee worker by subscribing to incoming block headers. The incoming block headers are populated by the common committee worker that the storage committee worker subscribes to via Who handles the lifetime of the state instances?The lifetime of the light history and corresponding indexer is currently owned by the runtime registry (see). State DB, on the other hand is created by the storage worker (see), yet closed by the runtime registry. Who is responsible for the pruning.Currently pruning of the runtime state (both
Steps towards cleaner designWe should get there incremental....
So who should be responsible for runtime state pruning?Looking how closely related Light history and State DB are, especially as we have the cases where you need both, I believe there should be a single process responsible for instantiating and closing both. Other processes can receive it as a parameter. Storage committee workerObserve that the storage committee worker is only started when you have a local storage and configured runtimes - see. So the first question should be do we have or envision a scenario where light history is being indexed but the storage worker should be disabled? Even if this is the case we could work around this. update: we have such scenario: #6385 (e.g. stateless client with configured paratime). I am inclined towards making the storage worker responsible for Light history and State DB lifetime, populating and pruning them. Especially if we refactor the storage worker into many smaller ones. Note this would also mean moving the indexer into it. Finally, I believe this worker should export methods like Other workers that rely on theses methods should define an interface and accept the storage committee worker as a parameter (therefore casting it to the subset of interface methods). Runtime registryThis is alternative to the above. Currently we don't have direct access to the Furthermore I am not really convinced by the idea of registry being responsible for pruning. ConslusionNone of the solutions will be trivial to refactor to, so would appreciate any input here to get there incrementally. |
|
Tried to solve both problems described in #6352. Opened a simple solution, that also suffers from some compromises, described commit by commit. For the clean solution we probably have to solve problems described in the mini design doc first :/. Prior to making this PR ready for review, writing tests and ensuring it works as expected (CI is green but haven't tested) I would appreciate if we could first discuss the high level direction. |
26e3fd9 to
eaf2c7c
Compare
eaf2c7c to
bab14de
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6355 +/- ##
==========================================
+ Coverage 64.25% 64.67% +0.42%
==========================================
Files 698 698
Lines 68047 68070 +23
==========================================
+ Hits 43721 44022 +301
+ Misses 19321 19013 -308
- Partials 5005 5035 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
bab14de to
4b56de4
Compare
|
Freshly rebased on top of #6384, that also briefly touched this issue. |
peternose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that first 5 commits look good, even if we change them latter. So we could merge and continue in another PR.
| // | ||
| // If an error is returned, pruning is aborted and the height is | ||
| // not pruned from history. | ||
| // CanPruneConsensus is called to check if the specified height can be pruned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just CanPrune? From consensus point of view, adding Consensus is just redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, although is easier to make sense of when implementing it inside go/runtime/history/history.go and also "symmetric" with CanPruneRuntime, so either or...
My suggestion would be to also get rid of
// Note that this can be called for the same height multiple
// times (e.g., if one of the handlers fails but others succeed
// and pruning is later retried).for both given that func Can*() error implies this should be safe...
4b56de4 to
20bf6c0
Compare
Not pruning because of the prune handler does not allow it, is not an error and should not be logged as such. Similar to #6161 that describes excesive usage of Error logging.
This rename is important to communicate clearly the handler only checks whether pruning can happen. Possibly this method could return bool instead of error.
By moving the pruning of the state db out of the prune handler, we solve issue of nesting NodeDB.Prune (possibly long operation), in the BadgerDB transaction created during the pruning of the light history. Futhermore, this also ensures that light history is always pruned before the actual state. This is desired as previously light history pruning had a side effect of deleting the state. If during pruning there was an error the runtime history would get reverted, but the state wouldn't. Looking at the `history.History` `resolveRound`, this unlikely scenario was not anticipated, which could result in `GetAnnotatedBlock` breaking its contract. There is additional change of semantics: Previously if your state had versions older than the earliest version in the light history pruning would be stuck, as it would try to prune non-earliest version from the NodeDB. Now, this is fixed but as a cosequence this could silently start a very hot loop of pruning many versions.
This was done to be consistent with consensus prune handlers terminology. Notice that consensus accepts a single round whereas here we take a slice of rounds. In my opinion pruning of light history in batch was a premature optimization as time to prune light block << time to prune NodeDB version. For the follow-up we may want to unify this, possibly prune handlers can be only checked for every n-th versions instead of checking every versions (assuming that if pruning is valid for N, than it should be also valid for N-1).
20bf6c0 to
51407eb
Compare
Agree and kept only those. Sanity checking semantic change (thinking aloud): Previously it could happen that state was pruned before the runtime light history (or possibly missing, when e.g. doing runtime state sync from the checkpoint earlier than the last reatained light header block - see). However, it never happened the other way. This change makes this less likely (except for the late checkpoint scenario), as light history is now always pruned before the State DB, changing the assumptions the old code may have. Therefore we should double check that none of the existing code first queries state / last retained version and then automatically assume light history has corresponding light block for the queried height. I did not find any such place, nor would it be sign of the robust code had I found it. Double check appreciated. Follow-up (won't be actively worked on) Opened #6400, and also referenced it in the code. As stated in the issue I believe the answer to "So who should be responsible for runtime state pruning" is clear: the storage committee worker. |
#6403 made me think I have to check the storage committee worker once more:
Update: false alarm |
|
^^ Tested / sanity checked everything yesterday and I believe if anything the code is more robust now. Ready for the final review. |
Closes #6352.
Considerations - #6355 (comment)