-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Session roll over fix #1536
Session roll over fix #1536
Conversation
Newly added tests
|
2cc748e
to
67cfa85
Compare
Note: some nodes will seal their evidence claim immediately after sesson rollover and this will result in evidence sealed errors. In order to remediate this, we could modify claim and proof tx to only occur on 2,3,4th of a session block instead depending on the nodes address offset. incorporating this network wide will require more due diligence as the reason why this code exists in first place is to prevent too many claims/proofs from being submitted at once in a single block |
@PoktBlade Thanks for doing this! I haven't reviewed the code but just wanted to ACK that I'm aware and it's on my backlog.
I, unfortunately, think that testing and getting confidence in this change is going to be one of the hardest parts, aside from the social coordination of deploying it. I have two requests:
|
Hey, I have set up my own localnet stack from rigging up my own E2E stack. So I can test to see if session roll over works. the e2e automatic test suite is new to me, I don't have the bandwidth to test that. It was new and delivered abruptly so no one besides pni has context of that. The previous testing mechanisms included a QA suite of test cases and then a manual regression test using said qa test suite. I can add some scenarios to that, and someone can translate it to the automatic version. |
tl;dr Noted, I will put the time into this myself.
I tried to put together good & easy-to-follow documentation (so it can be used by others), but if you can share the results of your own stack (potentially a video and/or documentation), that would be good too. Ref: https://github.com/pokt-network/pocket-core/blob/fixing_localnet/localnet.md
Understood. I tried to make the documentation easy to follow but will invest time in extending, improving and testing it. https://github.com/pokt-network/pocket-core/tree/staging/e2e |
Hey @Olshansk I responded to most of your comments - thanks for taking the time to review it. I think the largest I concern was that you mentioned one of the tests were not running for you after reverting the major core piece of this branch, indicating something might be wrong. Can you provide your test results and we can take it from there, thank you! |
Tests should not fail under the current designs. ExplanationSo the test scenario is mainly bootstrapped by a function Change madeIn your scenario, you revert my changes, so the network will always grab the latest session block height dictated by Tests:Test One (Simple handle relay):
Test Two (Within tolernace):
Test Three (outside tolernace):
pls explainNow you might be asking, what's up with these random/magic numbers in these tests, and why are the tests structured this way. Unfortunately, this is just tech debt within the code base already, and I'm happy to resolve as much as I can and provide clarity as you see fit. What's more important to me is that you understand the code that I'm writing and you agree that the logic works. Everything else, I prefer to be a bit more pragmatic and timely on how to solve these PRs. I'll solve as much as I can in the same PR and will suggest moving the unresolved ones to another GH issue/PR. Can you mark the ones as blockers, and the ones as nice to have, and I will prioritize as needed. P.S - I used delve to debug to see the values on the heap/stack and the function calls trace, good tool to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's more important to me is that you understand the code that I'm writing and you agree that the logic works.
I spent several hours yesterday looking at this and string to understand it, but I personally still don't have confidence that this works.
What I'd prefer is something similar to what @msmania has in #1534.
- Add feature switch
- Show failing test with feature switch off
- Turn on feature switch
- Show successful test with feature switch on
Now you might be asking, what's up with these random/magic numbers in these tests, and why are the tests structured this way. Unfortunately, this is just tech debt within the code base already, and I'm happy to resolve as much as I can and provide clarity as you see fit.
Sorry about that. I'm having a hard time wrapping my head around it, which is why I'm asking for the renaming of certain functions/variables and adding comments where appropriate.
Can you mark the ones as blockers, and the ones as nice to have, and I will prioritize as needed.
I added follow-up comments and re-reviewed the PR.
- Please take a look at the my comments and re-request a review once it's ready for another look.
- If you add me as an editor on your fork (or move the branch to the main repo), I can help make some of the edits myself on this branch.
Hey @Olshansk Addressed most of your comments, try giving it a read again with the test cases, it should provide more clarity for your reading. If it takes too long, just message me async and I'll do my best to explain the logic. |
Also, you mention we should have a feature flag.. Just a couple notes to that:
|
Sorry for interrupting the ongoing discussion. If I understand this patch correctly, the change in |
I would deem it necessary as if we are going to allow for “sessions to be rolled over” then we should allow the node to dispatch for the old sessions as well. |
Session role over can be achieved by the change in When someone, basically the Portal, starts a relay, it should always use the latest session and shouldn't request a relay with an old session because app or node's staking status may have been changed. And the Portal indeed does that by setting What your change in Moreover, you need to be careful about the second argument of |
Fair enough. Removed it. Portals who want to dispatch can later send in a PR or modify their own PCC to dispatch for an old session. |
Thanks! One more ask about test code. In your test, each testcase calls I think the test should be something like this. With this, one setup covers all scenarios and magic numbers are removed completely. func TestKeeper_HandleRelay(t *testing.T) {
setupHandleRelayTest(...)
blockPerSession := keeper.BlocksPerSession(ctx)
allowance = some positive value
types.GlobalPocketConfig.ClientSessionBlockSyncAllowance = 0
// Send relay with older heights without session roll over
for i := 0; i < blockPerSession*(allowance+1); i++ {
height := ctx.BlockHeight() - i
// create a relay payload with `height`
resp, err := keeper.HandleRelay(mockCtx, validRelay)
// verify the result accordingly
}
types.GlobalPocketConfig.ClientSessionBlockSyncAllowance = allowance
// Send relay with older heights with session roll over
for i := 0; i < blockPerSession*(allowance+1); i++ {
height := ctx.BlockHeight() - i
// create a relay payload with `height`
resp, err := keeper.HandleRelay(mockCtx, validRelay)
// verify the result accordingly
}
} |
If we were using something like Ginkgo to structure our tests, I'd agree. Otherwise, having the tests separately helps isolate the environments and what we're testing for since these are unit tests and not integration / e2e tests.
Also agree, but this would require more restructuring. I would push for a follow-up PR to improve tests if we really want to sharpen this old cold base (not worth it for this use case IMO), my bandwidth is quite limited atm to do more refactoring. If it's an absolute must, then please let me know and I'll find time to do it as I'd like see pokt continue to improve. (Or if you'd like to set up these changes, that'd be great too) |
/reviewpad summarize |
AI-Generated Pull Request Summary: This pull request makes several changes to the configuration file, including adding a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoktBlade A few more minor comments / NITs + I asked to leave a TODO per @msmania's suggestion so him or I can do it in a separate PR.
Two questions:
- Does the following diagram capture (visually) the changes made in a simple way in your opinion? Lmk if I should change anything
sequenceDiagram
title 4 Blocks Per Session
participant C as Client
participant N as Node(h=100)
alt 1. before change - ok
C->>+N: Relay(100<=h<=103)
N->>-C: Response + Reward
else 2. before change - err
C->>+N: Relay(h<100 || h > 103)
N->>-C: Err
else 3. after change - ok
C->>+N: Relay(100<=h<=103)
N->>-C: Response + Reward
C->>+N: Relay(96<=h<=99)
N->>-C: Response + ?? NoReward??
else 4. before change - err
C->>+N: Relay(h<96 || h > 103)
N->>-C: Err
end
-
Am I correct to understand that in case (3) above, the servicer won't be rewarded for serving a relay from a previous session
-
I was looking at
x/pocketcore/types/service.go
and trying to understand why we don't need to change the relay meta validation. Could you help me understand this?
func (m RelayMeta) Validate(ctx sdk.Ctx) sdk.Error {
// ensures the block height is within the acceptable range
if ctx.BlockHeight()+int64(GlobalPocketConfig.ClientBlockSyncAllowance) < m.BlockHeight || ctx.BlockHeight()-int64(GlobalPocketConfig.ClientBlockSyncAllowance) > m.BlockHeight {
return NewOutOfSyncRequestError(ModuleName)
}
return nil
}
x/pocketcore/keeper/session.go
Outdated
@@ -55,6 +55,18 @@ func (k Keeper) IsSessionBlock(ctx sdk.Ctx) bool { | |||
return ctx.BlockHeight()%k.posKeeper.BlocksPerSession(ctx) == 1 | |||
} | |||
|
|||
// IsProofSessionHeightWithinTolerance checks if the latest session block height is bounded by minSessionBlockHeight <= x <= latestBlockHeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// IsProofSessionHeightWithinTolerance checks if the latest session block height is bounded by minSessionBlockHeight <= x <= latestBlockHeight | |
// IsProofSessionHeightWithinTolerance checks if the relaySessionBlockHeight is bounded by (latestSessionBlockHeight - tolerance ) <= x <= latestBlockHeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted this to the max range to be latestSessionHeight
which is what my tests reflects as well. There is no reason to believe that the relay session block height would ever be above latest session block height.
if relay proof session block height is above latest session block height, then the app sent in a wrong app session height or the app dispatcher is ahead of the servicer node, which should result in an error
75a60d6
to
eef50cf
Compare
AI-Generated Summary: This pull request introduces a new configuration parameter |
AI-Generated Summary: This pull request introduces a new configuration option named |
Squashed all my initial commits to rebase easily
I believe this is good to merge in. @Olshansk, merge in if you think so as well! |
I'd move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits related to comments. Will approve after.
Unrelated and optional: @PoktBlade, sharing personal experience: Rebasing has worked well for me in other environments (e.g. big companies) and platforms (e.g. gerrit). But I've found that because of a lot of factors (OSS, github, PR flow, squash & merging to main, etc...), I just always `"merge with main" and resolve conflicts.
AI-Generated Summary: This pull request introduces a new configuration property |
Resolved the nits, good to go again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PoktBlade. Feel free to squash & merge into staging
!
I'll also take a stab at documenting all of the great discussions & notes we have here and will send your way to review by EOW. 🙏
With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97. There is a bug in the relay validation function `relay.Validate`. In the example above, the function sees the following values: ``` ctx.BlockHeight() = 101 sessionBlockHeight = r.Proof.SessionBlockHeight = 97 sessionCtx.BlockHeight() = 97 ``` and if the corresponding session is not cached, it passes `sessionCtx` and `ctx` to `NewSession`. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not. The proposed fix is if `ctx` is on the next session, we get a new context of the session end and pass it to `NewSession`.
With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97. There is a bug in the relay validation function `relay.Validate`. In the example above, the function sees the following values: ``` ctx.BlockHeight() = 101 sessionBlockHeight = r.Proof.SessionBlockHeight = 97 sessionCtx.BlockHeight() = 97 ``` and if the corresponding session is not cached, it passes `sessionCtx` and `ctx` to `NewSession`. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not. The proposed fix is if `ctx` is beyond the relay session, we get a new context of the session end and pass it to `NewSession`.
With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97. There is a bug in the relay validation function `relay.Validate`. In the example above, the function sees the following values: ``` ctx.BlockHeight() = 101 sessionBlockHeight = r.Proof.SessionBlockHeight = 97 sessionCtx.BlockHeight() = 97 ``` and if the corresponding session is not cached, it passes `sessionCtx` and `ctx` to `NewSession`. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not. The proposed fix is if `ctx` is beyond the relay session, we get a new context of the session end and pass it to `NewSession`.
This pull request has been mentioned on Pocket Network Forum. There might be relevant details there: https://forum.pokt.network/t/into-the-gateway-verse-bootstrapping-a-multi-gateway-ecosystem/4532/1 |
## Description This PR sets the default session sync allowance back to zero, effectively removing the session rollover fix as a default setting. Clients can re-enable as they see fit. Gateways should code logic that accounts for old sessions. Gateways should also always use the latest session when possible. PNI recently transitioned from AWS to GCP, and their Portal V1 stack to V2 stack. This has resulted in 1.3B to ~600m relays being submitted on the chain. Our fleet of 4K+ nodes has also seen reduced served traffic significantly. The rationale behind this change is simple: Too many changes all happening at once. This change is simply to prevent additional noise from the ongoing incident. Once PNI's V2 gateway is stable and network traffic is stable, we can set this default back to `1` with a new build given its nonconsensus changing. NOTE: This is to prevent any more damage with the ongoing incident and cause additional confusion for the gateway and node runners. Since we don't know what the root cause is for the missing traffic, by not including it by default, it removes one less factor for them to consider. Relevant PR: #1536 <!-- reviewpad:summarize:start --> ### Summary generated by Reviewpad on 27 Jun 23 04:41 UTC This pull request updates the default session sync allowance configuration from 1 to 0, representing a session unit (irrespective of num blocks per session). This ensures that the session sync allowance configuration is more adjustable for user's needs. <!-- reviewpad:summarize:end -->
Addresses #1464 by adding a configurable session block tolerance into pocket config
client_session_sync_allowance
with a default value of1
This means that once a new session appears, servicer nodes will still respect old sessions up until 1 session block (4 blocks at the time of this PR).
DISCLAIMER: SHOULD DEFINITELY TEST.......................... USING THE REGULAR RELEASE CYCLE MECHANISMS