Radically simplify staking logic #2312

cwgoes · 2018-09-12T06:52:16Z

Many of the state machine bugs we've found so far are in the staking module, particularly the UpdateValidators iteration cases and the cliff validator optimization logic. The stake keeper (x/stake/keeper) is generally difficult to understand and reason through - as demonstrated by our inability to catch these issues when writing the code or in review. In part this may be due to writing the code in stages and changing the underlying model several times.

Let's first discuss the clearest layout - maintaining the current performance (which is fine) but splitting up the logic much more clearly between self-contained functions with well-defined pre-conditions and post-conditions (most of which ideally can be defensively asserted). In parallel with this we should make sure to compare our code to the staking/slashing spec, which may have fallen a bit behind the implementation in a few places.

cc @rigelrozanski @alexanderbez @ValarDragon

Ref #2298 (comment)
Ref general discussion in #2173

The text was updated successfully, but these errors were encountered:

ValarDragon · 2018-09-12T21:09:51Z

Some ideas for main ideas to classify by:
Cliff validator exists or not?
Is the current validator new or not?

In the case of a new validator, we have a separate function for "pushing the cliff validator down", and etc. I think alot of the bugs come through because the functions aren't small things that are easy to verify independently.

rigelrozanski · 2018-09-12T22:00:19Z

All for it - I think the conceptualization of staking has expanded dramatically several times - many times we've tended to tack on new thinking without refactoring..
looks like it's time to clean out the freezer so to speak - take out all the frozen food, reorganize and repackage it, then put it back in.

Misc thoughts:

I don't think the staking logic itself can be "radically simplified" - the logic by nature is quite complex. However the logic can be made radically more comprehensible I reckon - and organization of the logic radically simplified.
Many of the existing functions have poorly specified pre-conditions in the code comments (def an easy place to being improvement).
Honestly, I think the biggest bugs we've been encountering is with regards to the cliff validator logic - and how it reacts on a variety of edge cases - this should probably be of primary focus. Thinking outside of the box this may be the time to completely reevaluate how we conceptualize what a cliff validator is - and potentially respecify what and how cliff logic should work. I think there is probably an alternative model to the current cliff model which changes the staking dynamics, but drastically simplifies the code. For example if we didn't care so much about kicking folks out immediately, we get rid of the cliff validator object entirely - and passively iterate through the tail of the bonded validator set and kick validators out in a lagged fashion (which may mean than once in a while the validator set is slightly greater than 100 but will be corrected in the next few blocks)

^ this is not a fully conceptualized idea, however I think something a little bit more lazy and along these lines may help A LOT! - what y'all think??

We should setup a call probs

Side thought:
This may be an opportunity to think about slow-staking module may be a good way to iron out bugs.

ValarDragon · 2018-09-12T22:08:23Z

I agree, perhaps we can change alot of the "cliff validator" paradigm. Its unclear to me why the following categorization doesn't suffice:

"bonded", "unbonded", "jailed".

I don't see a need for the cliff validator abstraction, as long as we keep "bonded" and "unbonded" sorted by power, which shouldn't be expensive. Inserting into a sorted list is logarithmic worst case, if we want to do some fancy predictive stuff #postlaunch, we may even be able to get it to log(log(n)). (Basically rob the idea from https://en.wikipedia.org/wiki/Interpolation_search, and come up with a decent predictor for location within the sorted list)

rigelrozanski · 2018-09-12T22:17:48Z

(side point we also need to have "unbonding" in that list - important for slashing right)

Yeah there is definitely a way to not have the cliff validator term at all without changing much, I think it just means that we need to have a bunch more iterating - which may be okay, maybe a much of the inefficiently that may come from having to iterate over a bunch of records (only retrieve one of them) can be solved on the store side of things. For instance, As far as I can tell there is not easy way to retrieve the "100th" record from the store without iterating over it, but I bet that's a piece of functionality which wouldn't be to difficult to build into the store.

but yeah maybe we leave this as a store optimization and kill cliff validator's ASAP - seems like it's just causing no end of headaches

alexanderbez · 2018-09-12T22:57:55Z

During the past few weeks of debugging this code, I can personally say that iteration and the notion of cliff validators are the biggest pain points for debugging. I found the following troublesome when debugging

When/how bondHeight/bondIntraTxCounter are set
(Reverse) Iteration of validators by power and who to unbond, who to bond, who to remove as cliff, and who to set as cliff, etc...

rigelrozanski · 2018-09-13T16:58:17Z

Ideas from the meeting:

The cases within the switch statement are hard to navigate
It could be good to seperate switch into a Classification and Execution code set
(spec Split up UpdateValidator per each state-transitions, & Tendermint updates in Endblock #2351) Cliff, move cliff validator to the end of the block (we need a proposal)
Obviously more distinct functions
(spec Split up UpdateValidator per each state-transitions, & Tendermint updates in Endblock #2351) Maybe combine Validator.Status, Validator.Jailed - (need to introduce things like Validator.IsUnbonding()) - then we can write everything as a state transition in update validator (really cool!!) - much better organization - Maybe this would allow us to eradicate update validator all together (only if this change is combined with use of end-block bonded-validator updates)
(spec/implementation @ValarDragon) Replace the IntraTx counter - GlobalMessageCounter which comes from the context (or maybe IntraMessageCounter from context -TBD) if we had a global counter then we could remove block-height from the Validator object (bonus)
No action necessary slashing/staking split is good
Instead of having a separate storage for cliff validator, instead just make it access the nth element of the store
#postlaunch consider using a cache BST instead of a cache KV store (which is a map)

ValarDragon · 2018-09-14T04:12:56Z

Its also interesting to note, that doing the lazy evaluation at end block will probably speed up execution under periods of high load. This is because if there are multiple validator set updates, you only have to do one iteration through it. Whereas before the number of iterations may not have been well bounded (e.g. 1 per message in the block), but in the simplest case there wouldn't be any iterations.

However, we can easily add a boolean "was there a stake related update this block", and only run through that iteration accordingly. (This is an easy #if-have-time-prelaunch optimization though)

rigelrozanski · 2018-09-14T04:33:21Z

@ValarDragon - great idea to add an optimization bool for if there was an update or not

ebuchman · 2018-09-18T17:51:55Z

Can we consider removing redelegation as part of this? It seems like pretty clearly non-MVP requirement and just adds complexity.

sunnya97 · 2018-09-18T18:05:51Z

I think instant redelegation is a requirement for having a proper decentralized Proof of Stake. Without it, delegation becomes so sticky and the initial validator set gets entrenched.

We've also publicly stated on multiple occasions that this is one of the defining unique features of our Proof of Stake implementation.

rigelrozanski · 2018-09-18T19:31:38Z

I'm going to second @sunnya97 for this, not to mention it's actually a fairly straightforward implementation, which doesn't need to change is already complete - there are very few edge cases. I think by removing and adding it back in later, we are going to generate more work for ourselves in pre1.0 phase - and I'm not convinced it will save us much time prelaunch (not to mention weaken our security model as sunny pointed out)

cwgoes · 2018-10-09T17:33:21Z

Most has been done, closing in favor of the remaining #1402.

…osmos#2312)

cwgoes added C:x/staking prelaunch C:x/slashing labels Sep 12, 2018

cwgoes mentioned this issue Sep 12, 2018

R4R: Ensure Tendermint Validator Update Invariants - Part II #2298

Merged

5 tasks

cwgoes added the S:proposed label Sep 12, 2018

alexanderbez mentioned this issue Sep 12, 2018

Randomized Fuzzer: cliff validator has not been changed, yet we bonded a new validator #2289

Closed

rigelrozanski mentioned this issue Sep 13, 2018

x/stake: Add another function for UpdateValidator if cliff validator doesn't exist #2293

Closed

alexanderbez mentioned this issue Sep 13, 2018

WIP: Fix Cliff Val Bug Caught in Simulation #2332

Closed

7 tasks

ValarDragon mentioned this issue Sep 13, 2018

ADR for global message counter #2334

Merged

This was referenced Sep 17, 2018

R4R: simulation: reduce number of failed governance msgs #2299

Merged

Split up UpdateValidator per each state-transitions, & Tendermint updates in Endblock #2351

Closed

ValarDragon added S:proposal accepted and removed S:proposed labels Sep 20, 2018

rigelrozanski mentioned this issue Sep 25, 2018

R4R: Split up UpdateValidator into distinct state transitions applied only in EndBlock #2394

Merged

5 tasks

alexanderbez added game-of-steaks and removed prelaunch labels Sep 25, 2018

alexanderbez mentioned this issue Sep 25, 2018

Randomized Fuzzer: should not already be unbonded or unbonding, validator: #2290

Closed

This was referenced Sep 25, 2018

Gaiad network crash in executing redelegation transactions #2241

Closed

Randomized fuzzer: expected to find slashing period, but none was found #2225

Closed

cwgoes closed this as completed Oct 9, 2018

chillyvee pushed a commit to chillyvee/cosmos-sdk that referenced this issue Mar 1, 2024

docs: keep the code guideline the same as test upgrade in workflow (c…

82a3b08

…osmos#2312)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radically simplify staking logic #2312

Radically simplify staking logic #2312

cwgoes commented Sep 12, 2018 •

edited

Loading

ValarDragon commented Sep 12, 2018 •

edited

Loading

rigelrozanski commented Sep 12, 2018

ValarDragon commented Sep 12, 2018 •

edited

Loading

rigelrozanski commented Sep 12, 2018

alexanderbez commented Sep 12, 2018

rigelrozanski commented Sep 13, 2018 •

edited

Loading

ValarDragon commented Sep 14, 2018 •

edited

Loading

rigelrozanski commented Sep 14, 2018

ebuchman commented Sep 18, 2018

sunnya97 commented Sep 18, 2018 •

edited

Loading

rigelrozanski commented Sep 18, 2018 •

edited

Loading

cwgoes commented Oct 9, 2018

Radically simplify staking logic #2312

Radically simplify staking logic #2312

Comments

cwgoes commented Sep 12, 2018 • edited Loading

ValarDragon commented Sep 12, 2018 • edited Loading

rigelrozanski commented Sep 12, 2018

ValarDragon commented Sep 12, 2018 • edited Loading

rigelrozanski commented Sep 12, 2018

alexanderbez commented Sep 12, 2018

rigelrozanski commented Sep 13, 2018 • edited Loading

ValarDragon commented Sep 14, 2018 • edited Loading

rigelrozanski commented Sep 14, 2018

ebuchman commented Sep 18, 2018

sunnya97 commented Sep 18, 2018 • edited Loading

rigelrozanski commented Sep 18, 2018 • edited Loading

cwgoes commented Oct 9, 2018

cwgoes commented Sep 12, 2018 •

edited

Loading

ValarDragon commented Sep 12, 2018 •

edited

Loading

ValarDragon commented Sep 12, 2018 •

edited

Loading

rigelrozanski commented Sep 13, 2018 •

edited

Loading

ValarDragon commented Sep 14, 2018 •

edited

Loading

sunnya97 commented Sep 18, 2018 •

edited

Loading

rigelrozanski commented Sep 18, 2018 •

edited

Loading