-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor chain following #2750
Refactor chain following #2750
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to simplify chain following.
lib/core/src/Cardano/Wallet/DB.hs
Outdated
@@ -329,6 +331,8 @@ newtype ErrNoSuchWallet | |||
= ErrNoSuchWallet WalletId -- Wallet is gone or doesn't exist yet | |||
deriving (Eq, Show) | |||
|
|||
instance Exception ErrNoSuchWallet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possibility of both checked and unchecked exceptions - this is something we want to avoid.
-- TODO: Recover on connection lost exceptions! | ||
connectClient tr handlers client versionData conn | ||
|
||
, currentNodeTip = | ||
fromTip getGenesisBlockHash <$> atomically readNodeTip | ||
, currentNodeEra = | ||
-- NOTE: Is not guaranteed to be consistent with @currentNodeTip@ | ||
readCurrentNodeEra | ||
, watchNodeTip = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the reformulation of chainSync, this function will probably be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how watchNodeTip
would be related to this PR.
We use it for
- Caching reward balances in DB
- Tx resubmission
It would be good to unify the DB and in-memory caching of rewards, but that's a separate concern. (I think this might contribute to values being slower to be updated, and perhaps to some integration test flakiness).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean we could implement watchNodeTip
using chainSync
?
But I think
- having a single tip-sync protocol and multiple
STM
observers (allow callbacks to be skipped under load) - not having to deserialise block bodies
are good qualities with the current approach, plus I don't see how ChainFollower
could allow fast-forwarding the intersection to the tip.
c83e55a
to
59fe9d9
Compare
5e8501d
to
efc0229
Compare
Closing for now. |
Let's keep the branch and re-open this PR later, because this refactor is something that we definitely need. |
d774497
to
b1d146f
Compare
d265671
to
dd9ab0a
Compare
e792da7
to
405cfb6
Compare
4f86800
to
a74655c
Compare
bors try |
tryBuild succeeded: |
The integration test failures seem to be unrelated to the new chain following code. On my local machine, these failures do pass. Instead, the failures seem to be related to:
It appears that these problems will have to be addressed in the database layer. |
a48f16a
to
63781c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an incomplete first pass now as well as some commits I saw in the past, lgtm!, left some comments, but will have another look.
Maybe we could regardless have a ~20 minute call about it tomorrow some time after the futurespective?
@@ -505,7 +520,7 @@ localStateQuery queue = | |||
:: LocalStateQueryCmd block m | |||
-> m (LSQ.ClientStAcquired block (Point block) (Query block) m Void) | |||
clientStAcquired (SomeLSQ cmd respond) = pure $ go cmd $ \res -> do | |||
LSQ.SendMsgRelease (respond res >> clientStIdle) | |||
LSQ.SendMsgRelease (respond res >> finalizeCmd >> clientStIdle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if not technically a lost connection between respond res
and finaliseCmd
can lead to the callback being called twice.
But this should be exceedingly rare, and it shouldn't affect queries using our send
helper:
send
:: MonadSTM m
=> TQueue m (cmd m)
-> ((a -> m ()) -> cmd m)
-> m a
send queue cmd = do
tvar <- newEmptyTMVarIO
atomically $ writeTQueue queue (cmd (atomically . putTMVar tvar))
atomically $ takeTMVar tvar
and I don't imagine it would affect anything else either.
If you concur, maybe we should add a note about it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I do not believe that there is a race condition here. 🤔 As far as I can tell, losing the connection to a node is a synchronous exception: The next read or write to the underlying socket will fail. As long as we do not read or write to the socket, there will be no exception related to it.
If this code path receives an asynchronous exception, then this will not be caught by recoveringNodeConnection
, and it doesn't matter whether we put anything back into the queue or not.
But I agree that it's best for respond
to avoid throwing a synchronous exception (e.g. from trying to read the node) and will add a note to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, losing the connection to a node is a synchronous exception: The next read or write to the underlying socket will fail. As long as we do not read or write to the socket, there will be no exception related to it.
Ok, good point, thanks
The branch survives the test from ADP-871 which is great, but each time, after node or wallet server is restarted, all the wallets begin to sync from scratch, that is:
|
2172c52
to
30f9d1c
Compare
284c66d
to
8f0d750
Compare
I rewrote the commit history (though perhaps a bit too aggressively). 🤓 |
Would you mind rebasing over master 🙏 |
* Group existing functions into logical sections * Refactor `connectCardanoApiClient` to be more general and to compete directly with `connectClient` * Add `mkLocalTxSubmissionClient` to follow the existing pattern for creating protocol clients.
1. Ensure follow catches asyncronous exceptions from connectClient, such that it can restart with a new connection and cursor. 2. Keep Local State Queries and to-be-submitted Txs queued until their requests finish, not just when they start. If a query is interrupted by the node being disconnected, it will block until a connection is re-established, and then retry.
New type `ChainFollower` provides callbacks that are used to drive and respond to the node-to-client messages.
* Percolate the `ChainPoint` type halfway to the database layer * TODO later: Use the type provided by `Cardano.Api` instead
* Remove unused messages * Remove superstitious printing of exceptions in `connectClient`. * Remove `MsgFollowLog` constructor
* Check whether the block we want to rollback to is the genesis block, use that in `ChainPoint`. * Request genesis explicitly when `MsgIntersectNotFound`. This ensures that the /read-pointer/ on the block producer side points to genesis, avoiding a weird corner case.
… so that the wallet does not sync from the origin anymore. ><
8f0d750
to
99cfe7c
Compare
Sure, no problem. I have fixed the issue with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for picking this up!
I cannot officially approve since it is was my PR, so feel free to do so yourself.
-- Cave: An empty list is interpreted as requesting the genesis point. | ||
let points' = if null points | ||
then [Point Origin] | ||
else sortBy (flip compareSlot) points -- older points last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah wow — does this mean on master
we'd always rollback to the oldest common checkpoint? 🤔
I see there's a Asc CheckpointSlot
in the wallet DB listCheckpoints
https://github.com/input-output-hk/cardano-wallet/blob/c8cbdb8e40763f13bc58ecc8af9b39f34b0d0314/lib/core/src/Cardano/Wallet/DB/Sqlite.hs#L1416
Is there a case for dropping the sortBy
and changing the direction of the sort the DB returns (both wallet db and pool db)?
Pros:
- Perhaps negligible performance increase
- avoids the cognitive overhead of sorting it in different directions in two different places
Con:
- easier for chainSync call-sites to mess up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah wow — does this mean on master we'd always rollback to the oldest common checkpoint? 🤔
Fortunately not, as that would have meant rolling back to genesis (this is the issue that Piotr encountered). There was a call to reverse
somewhere in the old follow
function on master. 😅
Actually, the specification of the ChainSync mini-protocol stipulates that the node should ignore the sort order and returns the youngest point on the chain — but cardano-node
behaves differently. See IntersectMBO/ouroboros-network#3443
Is there a case for dropping the sortBy and changing the direction of the sort the DB returns (both wallet db and pool db)?
In light of the ChainSync spec, I would make the case for keeping the sortBy
in the chainSync
function and instead dropping the sort direction from the DB layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of the ChainSync spec, I would make the case for keeping the sortBy in the chainSync function and instead dropping the sort direction from the DB layer.
(But do nothing for now, as this would become obsolete in the DB layer redesign anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Awesome improvement!
Playing around with the branch a little and It looks that it solves multiple stability issues a.k.a. "vanishing wallets", addressed in:
I've also run this branch against e2e tests few times and no "wallet_not_responding" observed, which was rather common before, now passing e.g.:
https://github.com/input-output-hk/cardano-wallet/actions/runs/1415608203
https://github.com/input-output-hk/cardano-wallet/actions/runs/1412956182
https://github.com/input-output-hk/cardano-wallet/actions/runs/1415762485
https://github.com/input-output-hk/cardano-wallet/actions/runs/1412958475
❤️
bors r+ |
Build succeeded: |
Issue Number
Based on #2745
ADP-871
Motivation
Simplify and clean up the chain following code, in the hope of making the wallet more robust against node disconnects. In particular, reduce the number of threads used to run the ChainSync protocol.
Overview
chainSync
function now takes a record of callbacks,ChainFollower
, as an argument. These callbacks are used to request the current intersection, as well as react to roll forward and roll backwards messages.Cursor
type.Progress
ChainFollowLog
messages are traced or removed if redundant.withFollowStatsMonitoring
runs in a separate thread so that it can compute statistics in regular time intervals.connectTo
and are handled byrecoveringNodeConnection
Comments