Fix slow/stuck unstaking due to toggling in epoch#13501
Fix slow/stuck unstaking due to toggling in epoch#13501CriesofCarrots merged 4 commits intosolana-labs:masterfrom
Conversation
| .stakes | ||
| .read() | ||
| .unwrap() | ||
| .clone_with_epoch(epoch, new.stake_program_v2_enabled()); |
There was a problem hiding this comment.
Actual clone()-ing moved here because I wanted to touch runtime feature (via .stake_program_v2_enabled()).
| .clone_with_epoch(epoch, new.stake_program_v2_enabled()); | ||
| *new.stakes.write().unwrap() = cloned; | ||
|
|
||
| let leader_schedule_epoch = epoch_schedule.get_leader_schedule_epoch(slot); |
There was a problem hiding this comment.
moved here because we should update anything related stakes after we update the Bank::stakes conditionally on the runtime feature activation.
| } else if fix_stake_deactivate && self.activation_epoch == self.deactivation_epoch { | ||
| // order important; and this correctly consider boostrap stakes first | ||
| (0, 0) |
There was a problem hiding this comment.
For this three lines, we need a 400+ lines pr. What a wonderful world.
| OldDeactivationBehavior::Slow, | ||
| false, | ||
| &[ | ||
| (0, 0, 0), |
There was a problem hiding this comment.
notice, previously we're doing right for the calculation of stake when it's the epoch we modified the stake account.
| false, | ||
| &[ | ||
| (0, 0, 0), | ||
| (13, 0, 13), |
There was a problem hiding this comment.
but, at the next epoch, we're incorrectly activating it by iterating the loop exactly once.
| &[ | ||
| (0, 0, 0), | ||
| (13, 0, 13), | ||
| (10, 0, 10), |
There was a problem hiding this comment.
And cooldown must be wrongly required.
|
todo:
|
Codecov Report
@@ Coverage Diff @@
## master #13501 +/- ##
========================================
Coverage 82.1% 82.1%
========================================
Files 378 378
Lines 90505 90619 +114
========================================
+ Hits 74317 74430 +113
- Misses 16188 16189 +1 |
There was a problem hiding this comment.
One nit. Otherwise this looks good to me.
The existing cli code is actually nice for testing; I ran solana stake-account and solana stakes against mainnet-beta to see the issue resolved :)
It'd be nice to get a preview of what will happen to MNB stake-history after this goes into effect (with no other activations/deactivations), since we've had that stuck-deactivating 15.31303968 SOL following us since epoch 17. Would that be easy to simulate in your local tests?
Pull request has been modified.
|
(for the record): local testing result Upon immediate onset of next epoch following feature activation, stuck deactivating stake disappears and stuck accounts suddenly chaged to deactivated |
In short, as this local testing shows, it only corrects future stake history entries newer than or equal to the epoch On top of stake history sysvar, this fix will also suddenly changes stake distribution and consequencial leader schedule of N+2. The inconsistent stake history indeed slightly continue to skew the current cluster-wide stake distribution if there are still activating stakes (should be verry small) since epoch 17 because we don't update intermediate epoch's stake history. Also this slightly affects all of deactivating stakes, because cluster-wide deactivating stake is suddenly will be reduced. Anyway, I think we can do nothing about this. Reconstructing stake history on-chain is now impossible and doing that off-chain and replacing with it will be very questionable practice. We can't retrospectively update older epoch entries because we can't re-construct older stake history entries only with current live account view. That's because some stakes account has been removed. I think that's why stake history exists in the first place. Anyway, the effect should be very small overall. So I don't think this is a real concern.
Also, I can simulate locally and I'll post the result |
|
base warped without fix warped with fix diff: diff --git a/bbb b/aaa
index d23742fdb6..afd0edf741 100644
--- a/bbb
+++ b/aaa
@@ -1,5 +1,5 @@
Epoch Effective Stake Activating Stake Deactivating Stake
- 111 163458814.134642482 5501268.70896064 516292.52201376 SOL
+ 111 163458798.821602821 5501268.70896064 516277.20897408 SOL
110 167401122.360376567 1558239.481008 5500563.01978176 SOL
109 169346111.643292427 1661779.5395936 3606784.13554912 SOL
108 169237091.558366656 160538.31356384 51533.54167776 SOL
for the record, anyone can easily test like (with https://github.com/ryoqun/solana/tree/slow-stuck-stake-deactivation-manual-testing): |
|
Also, it looks like all of affected stake accounts are delegating now inactive leader validators. I see no difference of stake distribution at present. |
Oh yes, I knew this would only affect future epochs. I was asking because having a non-zero |
68b2467 to
b583bce
Compare
| rpc_client | ||
| .get_account(&feature_set::stake_program_v2::id()) | ||
| .ok() | ||
| .and_then(|account| feature::from_account(&account)) | ||
| .and_then(|feature| feature.activated_at) | ||
| .is_some() |
There was a problem hiding this comment.
RpcClient::is_feature_enabled(feature_id) might be a useful method someday, but just have this specific use for now.
There was a problem hiding this comment.
@CriesofCarrots Oh, sorry for post merge comment, but, I think this might be better off checking the current slot of cluster. Otherwise, there is small time window where this returns too newer stake calculation behavior when the feature is pending to be activated for the next epoch.
There was a problem hiding this comment.
I believe this is correct as-is. The activation_at field in the account is changed from None to Some(slot) on the epoch boundary when the feature activates: https://github.com/solana-labs/solana/blob/master/runtime/src/bank.rs#L4119-L4122
When the feature activation has been requested, but is still pending,
Line 1895 in 9821a77
None, so the old stake calculation is used.
This is the same logic used here: https://github.com/solana-labs/solana/blob/master/cli/src/feature.rs#L298
There was a problem hiding this comment.
oh yeah. you're right! Sorry for misguiding you... I misunderstood the code. It seems that my in-brain runtime feature activation code got corrupted somehow... ;)
b583bce to
2631caa
Compare
CriesofCarrots
left a comment
There was a problem hiding this comment.
This seems good to me
* Fix slow/stuck unstaking due to toggling in epoch * nits * nits * Add stake_program_v2 feature status check to cli Co-authored-by: Tyera Eulberg <tyera@solana.com> (cherry picked from commit 89b474e)
* Fix slow/stuck unstaking due to toggling in epoch * nits * nits * Add stake_program_v2 feature status check to cli Co-authored-by: Tyera Eulberg <tyera@solana.com> (cherry picked from commit 89b474e)
* Fix slow/stuck unstaking due to toggling in epoch * nits * nits * Add stake_program_v2 feature status check to cli Co-authored-by: Tyera Eulberg <tyera@solana.com> (cherry picked from commit 89b474e) Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
* Fix slow/stuck unstaking due to toggling in epoch (#13501) * Fix slow/stuck unstaking due to toggling in epoch * nits * nits * Add stake_program_v2 feature status check to cli Co-authored-by: Tyera Eulberg <tyera@solana.com> (cherry picked from commit 89b474e) * Fix conflict * PartialEq<Vec<T>> is not impl for &[T] on rust v1.45.1 Co-authored-by: Ryo Onodera <ryoqun@gmail.com> Co-authored-by: Tyera Eulberg <tyera@solana.com>
|
Thanks for merging this and fixing the cli part! left some comment here: #13501 (comment) |
Problem
if activation_epoch == deactivation_epoch, there should be no need for cooldown. But we incorrectly require it not at the epoch, but at the following epochs. And if stake history says no other deactivating stakes, the cooldown could be stuck. Otherwise (most of time), we incorrectly cool-downs the quickly-deactivated stake account very slowly.
For details, please see the attached test.
Summary of Changes
We need to ensure to return
(0, 0)(= no effective stake, no activating stake) for stake account activated-then-deactivated in a single epoch.Special gating required because of duality of codepath when calculating stakes: instruction processor and normal runtime:
stake_and_activating(..., false);stake_and_activating(..., true);stake_and_activating(..., flag)whereflagis conditioned by runtime feature.Fixes #10982
Context: #13460 #13471.