Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear the global session cache when a node is unjailed #1529

Merged
merged 2 commits into from
Apr 6, 2023
Merged

Clear the global session cache when a node is unjailed #1529

merged 2 commits into from
Apr 6, 2023

Conversation

msmania
Copy link
Contributor

@msmania msmania commented Mar 1, 2023

This is a fix for a consensus failure with wrong Block.Header.AppHash, which is probably the same issue as #1478.

Issue description:

For a node to be selected for a session, it must be valid not only at the start of a session but also at the end of a session. If a node is edit-staked, jailed, or unjailed in the middle of a session, a servicer set of the session is changed.

When a node serves relays, on the other hand, it stores session information into the global session cache. If the session's servicer set is changed, the node needs to update the cached session. Otherwise the node accepts a claim that should be rejected, or rejects a claim that should be accepted. As a result, the node ends up with a consensus failure with wrong Block.Header.AppHash.

Root cause and Fix:

The root cause is simple. We have a call to ClearSessionCache in EditStakeValidator, LegacyForceValidatorUnstake, ForceValidatorUnstake, and JailValidator, but not in UnjailValidator. The proposed fix is to add the call in UnjailValidator as well.

@Olshansk Olshansk added the bug Something isn't working label Mar 7, 2023
@Olshansk Olshansk self-requested a review March 7, 2023 22:37
@Olshansk
Copy link
Member

Olshansk commented Mar 7, 2023

Thanks @msmania. Logically, this makes sense to me and I've confirmed that we call ClearSessionCache in all other relevant locations.

My only concern is that I don't have the resources right now to validate this and the testing coverage isn't good enough either.

Ccing @PoktBlade for a 2nd opinion if this is good to just 🚢 ?

@nodiesBlade
Copy link
Contributor

nodiesBlade commented Mar 8, 2023

Good find, you are correct that we should be clearing the session cache when unjailing, and this could cause consensus issues due to caching causing nondeterministic behavior.

Most validators and servicers should not have the session cached until the claim/proof lifecycle. They would only if the validator/servicer is in the same app session.

Furthermore, sessions are cleared in general based on certain transactions. TLDR - we're riding off luck with nondeterministic behavior

Where does the risk lie?:
Given LeanPocket allows nodes to be stacked onto each other, validators will share the same session cache, and such could result in a consensus failure with enough luck if 1/n stacked validators are paired with the jailed servicer that submits a claim. This is because then all stacked validators have the 'wrong' session cached, and unless there is another transaction that clears the session cache, the stacked validators will likely all disagree with the other validators in the network.

Even without LP, the probability of it happening is non-zero.

I am unsure if we should merge this change, or if we should consider a consensus-breaking change for this otherwise we will have some validators clearing the cache and some validators not clearing the cache, which could cause more issues.

@nodiesBlade
Copy link
Contributor

If we want to play it 100% safe, then an upgrade height would be ideal, and release it ASAP.

@nodiesBlade
Copy link
Contributor

I don't mean to distract this conversation, but I added a sample implementation here:

#1532

It compiles, that's as far as I've tested it.

@msmania msmania mentioned this pull request Apr 3, 2023
@reviewpad reviewpad bot added small Pull request is small waiting-for-review labels Apr 4, 2023
Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments to the PR in #1532 that has the flag. Do you mind copying those changes into this PR?

@reviewpad reviewpad bot requested a review from okdas April 4, 2023 22:52
@reviewpad
Copy link

reviewpad bot commented Apr 5, 2023

AI-Generated Pull Request Summary: This pull request includes two patches:

  1. Clear the global session cache when a node is unjailed. This change ensures the cache is cleared whenever a validator is unjailed to prevent any non-deterministic cache consistency issues.

  2. Porting the upgrade key from Placeholder for clearing session cache for unjailed vals with upgrade height #1532. This patch introduces a new feature key, "CLSES", and adds more test cases. It also includes checking tolerance on the upgrade height which helps to enable certain business logic within some tolerance of feature activation, providing more confidence in the feature's release and avoids non-deterministic or hard-to-predict behavior.

@reviewpad reviewpad bot added medium Pull request is medium and removed small Pull request is small labels Apr 5, 2023
@reviewpad
Copy link

reviewpad bot commented Apr 5, 2023

AI-Generated Pull Request Summary: This pull request includes two patches:

  1. Clear the global session cache when a node is unjailed: This patch modifies the UnjailValidator function to clear the global session cache when a node is unjailed. It also updates the comments and adds a feature flag to conditionally enable the cache clearing after a specific upgrade height.

  2. Porting the upgrade key from Placeholder for clearing session cache for unjailed vals with upgrade height #1532: This patch adds a new function IsOnNamedFeatureActivationHeightWithTolerance to the Codec that checks whether a specific feature is activated within a given height tolerance. Additionally, the patch introduces the ClearUnjailedValSessionKey constant, adds a test case for the new function, and updates the BeginBlock function in the gov module to clear the global session cache only for a few blocks around the upgrade height. The changes are made across multiple files, including codec.go, codec_test.go, module.go, and valStateChanges.go.

"github.com/pokt-network/pocket-core/crypto"
exported2 "github.com/pokt-network/pocket-core/x/apps/exported"
"github.com/pokt-network/pocket-core/x/auth"
"github.com/pokt-network/pocket-core/x/gov"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change?
x/gov/module.go importing x/pocketcore/types caused a cyclic import in the x/pocketcore/types package because this file imported x/gov.

It seems we don't need makeTestCodec() at all. I verified all tests under x/pocketcore/types and x/pocketcore/keeper passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me

@msmania
Copy link
Contributor Author

msmania commented Apr 5, 2023

I added a few comments to the PR in #1532 that has the flag. Do you mind copying those changes into this PR?

Done!

@msmania msmania requested a review from Olshansk April 5, 2023 19:40
@msmania msmania removed the request for review from okdas April 6, 2023 21:39
@jessicadaugherty jessicadaugherty merged commit 12d66de into pokt-network:staging Apr 6, 2023
@msmania msmania deleted the bugfix-wrong-apphash branch April 6, 2023 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working medium Pull request is medium waiting-for-review
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants