Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor chain following #2750

Merged
merged 10 commits into from
Nov 4, 2021
Merged

Conversation

Anviking
Copy link
Member

@Anviking Anviking commented Jul 1, 2021

Issue Number

Based on #2745
ADP-871

Motivation

Simplify and clean up the chain following code, in the hope of making the wallet more robust against node disconnects. In particular, reduce the number of threads used to run the ChainSync protocol.

Overview

  • The chainSync function now takes a record of callbacks, ChainFollower, as an argument. These callbacks are used to request the current intersection, as well as react to roll forward and roll backwards messages.
  • Remove the old Cursor type.
  • Reduce the number of threads used for implementing the above.

Progress

  • Integration tests run without errors
  • Clean up / review more closely
    • Corner cases
    • Logging
      • Aggregation of sync progress works again
      • Ensure that all ChainFollowLog messages are traced or removed if redundant.
      • withFollowStatsMonitoring runs in a separate thread so that it can compute statistics in regular time intervals.
    • Node disconnect errors
      • … are thrown as exceptions in connectTo and are handled by recoveringNodeConnection
      • Check again at the end that the PR solves the bug
    • Can we property-test this? (Probably no benefit.)
    • Check for performance regressions (running the chain-sync and db operations on the same thread might make it slightly slower)
      • Actually seemed 10% faster running integration tests locally with -j 8.
        • ef3fdc3 -> 274s (n=2)
        • master -> 303s (n=1) <- but I only ran once
        • Speedup could be related to chain-following now being able to rollback without necessarily re-negotiating the intersection.

Comments

@Anviking Anviking self-assigned this Jul 1, 2021
@Anviking Anviking changed the title Very WIP: Refactor chain following Rough draft: Refactor chain following Jul 1, 2021
Copy link
Contributor

@rvl rvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to simplify chain following.

@@ -329,6 +331,8 @@ newtype ErrNoSuchWallet
= ErrNoSuchWallet WalletId -- Wallet is gone or doesn't exist yet
deriving (Eq, Show)

instance Exception ErrNoSuchWallet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possibility of both checked and unchecked exceptions - this is something we want to avoid.

-- TODO: Recover on connection lost exceptions!
connectClient tr handlers client versionData conn

, currentNodeTip =
fromTip getGenesisBlockHash <$> atomically readNodeTip
, currentNodeEra =
-- NOTE: Is not guaranteed to be consistent with @currentNodeTip@
readCurrentNodeEra
, watchNodeTip = do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the reformulation of chainSync, this function will probably be redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how watchNodeTip would be related to this PR.

We use it for

  1. Caching reward balances in DB
  2. Tx resubmission

It would be good to unify the DB and in-memory caching of rewards, but that's a separate concern. (I think this might contribute to values being slower to be updated, and perhaps to some integration test flakiness).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean we could implement watchNodeTip using chainSync?

But I think

  1. having a single tip-sync protocol and multiple STM observers (allow callbacks to be skipped under load)
  2. not having to deserialise block bodies

are good qualities with the current approach, plus I don't see how ChainFollower could allow fast-forwarding the intersection to the tip.

@Anviking Anviking force-pushed the anviking/ADP-871/refactor-chain-following branch from c83e55a to 59fe9d9 Compare July 6, 2021 19:23
@Anviking Anviking changed the base branch from anviking/ADP-871/node-connections to master July 6, 2021 19:30
@Anviking Anviking force-pushed the anviking/ADP-871/refactor-chain-following branch from 5e8501d to efc0229 Compare July 7, 2021 08:32
@Anviking Anviking mentioned this pull request Jul 7, 2021
4 tasks
@Anviking
Copy link
Member Author

Closing for now.

@Anviking Anviking closed this Aug 10, 2021
@rvl
Copy link
Contributor

rvl commented Aug 14, 2021

Let's keep the branch and re-open this PR later, because this refactor is something that we definitely need.

@Anviking Anviking reopened this Sep 1, 2021
@Anviking Anviking force-pushed the anviking/ADP-871/refactor-chain-following branch from d774497 to b1d146f Compare September 2, 2021 12:25
@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch 2 times, most recently from d265671 to dd9ab0a Compare October 18, 2021 13:52
@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch from e792da7 to 405cfb6 Compare October 20, 2021 14:36
@HeinrichApfelmus HeinrichApfelmus changed the title Rough draft: Refactor chain following Refactor chain following Oct 22, 2021
@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch from 4f86800 to a74655c Compare October 22, 2021 19:39
@HeinrichApfelmus
Copy link
Contributor

bors try

iohk-bors bot added a commit that referenced this pull request Oct 26, 2021
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Oct 26, 2021

try

Build succeeded:

@HeinrichApfelmus
Copy link
Contributor

The integration test failures seem to be unrelated to the new chain following code. On my local machine, these failures do pass.

Instead, the failures seem to be related to:

  • File deletion on windows (when cleaning up old wallets)
  • Problems with DB concurrency (SQLITE_BUSY)

It appears that these problems will have to be addressed in the database layer.

@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch from a48f16a to 63781c0 Compare October 26, 2021 15:38
@HeinrichApfelmus HeinrichApfelmus marked this pull request as ready for review October 26, 2021 15:38
Copy link
Member Author

@Anviking Anviking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an incomplete first pass now as well as some commits I saw in the past, lgtm!, left some comments, but will have another look.

Maybe we could regardless have a ~20 minute call about it tomorrow some time after the futurespective?

lib/core/src/Cardano/Wallet.hs Show resolved Hide resolved
lib/core/src/Cardano/Wallet.hs Show resolved Hide resolved
lib/core/src/Cardano/Wallet/DB/Sqlite.hs Show resolved Hide resolved
@@ -505,7 +520,7 @@ localStateQuery queue =
:: LocalStateQueryCmd block m
-> m (LSQ.ClientStAcquired block (Point block) (Query block) m Void)
clientStAcquired (SomeLSQ cmd respond) = pure $ go cmd $ \res -> do
LSQ.SendMsgRelease (respond res >> clientStIdle)
LSQ.SendMsgRelease (respond res >> finalizeCmd >> clientStIdle)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if not technically a lost connection between respond res and finaliseCmd can lead to the callback being called twice.

But this should be exceedingly rare, and it shouldn't affect queries using our send helper:

send
    :: MonadSTM m
    => TQueue m (cmd m)
    -> ((a -> m ()) -> cmd m)
    -> m a
send queue cmd = do
    tvar <- newEmptyTMVarIO
    atomically $ writeTQueue queue (cmd (atomically . putTMVar tvar))
    atomically $ takeTMVar tvar

and I don't imagine it would affect anything else either.

If you concur, maybe we should add a note about it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I do not believe that there is a race condition here. 🤔 As far as I can tell, losing the connection to a node is a synchronous exception: The next read or write to the underlying socket will fail. As long as we do not read or write to the socket, there will be no exception related to it.

If this code path receives an asynchronous exception, then this will not be caught by recoveringNodeConnection, and it doesn't matter whether we put anything back into the queue or not.

But I agree that it's best for respond to avoid throwing a synchronous exception (e.g. from trying to read the node) and will add a note to that effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, losing the connection to a node is a synchronous exception: The next read or write to the underlying socket will fail. As long as we do not read or write to the socket, there will be no exception related to it.

Ok, good point, thanks

@piotr-iohk
Copy link
Contributor

piotr-iohk commented Oct 27, 2021

The branch survives the test from ADP-871 which is great, but each time, after node or wallet server is restarted, all the wallets begin to sync from scratch, that is:

  • after restarting the node (with wallet server working in bg)
  • or after restarting the wallet (with node working in bg)
  • also after restarting both... :)

@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch from 2172c52 to 30f9d1c Compare October 28, 2021 15:47
@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch 2 times, most recently from 284c66d to 8f0d750 Compare October 29, 2021 15:55
@HeinrichApfelmus
Copy link
Contributor

I rewrote the commit history (though perhaps a bit too aggressively). 🤓

@piotr-iohk
Copy link
Contributor

Would you mind rebasing over master 🙏

HeinrichApfelmus and others added 10 commits November 1, 2021 12:39
* Group existing functions into logical sections
* Refactor `connectCardanoApiClient` to be more general and to compete directly with `connectClient`
* Add `mkLocalTxSubmissionClient` to follow the existing pattern for creating protocol clients.
1. Ensure follow catches asyncronous exceptions from connectClient,
such that it can restart with a new connection and cursor.

2. Keep Local State Queries and to-be-submitted Txs queued until their
requests finish, not just when they start. If a query is interrupted by
the node being disconnected, it will block until a connection is
re-established, and then retry.
New type `ChainFollower` provides callbacks that are used to drive and respond to the node-to-client messages.
* Percolate the `ChainPoint` type halfway to the database layer
* TODO later: Use the type provided by `Cardano.Api` instead
* Remove unused messages
* Remove superstitious printing of exceptions in `connectClient`.
* Remove `MsgFollowLog` constructor
* Check whether the block we want to rollback to is the genesis block, use that in `ChainPoint`.
* Request genesis explicitly when `MsgIntersectNotFound`. This ensures that the /read-pointer/ on the block producer side points to genesis, avoiding a weird corner case.
… so that the wallet does not sync from the origin anymore. ><
@HeinrichApfelmus HeinrichApfelmus force-pushed the anviking/ADP-871/refactor-chain-following branch from 8f0d750 to 99cfe7c Compare November 1, 2021 11:40
@HeinrichApfelmus
Copy link
Contributor

Would you mind rebasing over master 🙏

Sure, no problem. I have fixed the issue with the HasLogger class.

piotr-iohk pushed a commit that referenced this pull request Nov 2, 2021
Copy link
Member Author

@Anviking Anviking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for picking this up!

I cannot officially approve since it is was my PR, so feel free to do so yourself.

-- Cave: An empty list is interpreted as requesting the genesis point.
let points' = if null points
then [Point Origin]
else sortBy (flip compareSlot) points -- older points last
Copy link
Member Author

@Anviking Anviking Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wow — does this mean on master we'd always rollback to the oldest common checkpoint? 🤔

I see there's a Asc CheckpointSlot in the wallet DB listCheckpoints
https://github.com/input-output-hk/cardano-wallet/blob/c8cbdb8e40763f13bc58ecc8af9b39f34b0d0314/lib/core/src/Cardano/Wallet/DB/Sqlite.hs#L1416

Is there a case for dropping the sortBy and changing the direction of the sort the DB returns (both wallet db and pool db)?

Pros:

  • Perhaps negligible performance increase
  • avoids the cognitive overhead of sorting it in different directions in two different places

Con:

  • easier for chainSync call-sites to mess up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wow — does this mean on master we'd always rollback to the oldest common checkpoint? 🤔

Fortunately not, as that would have meant rolling back to genesis (this is the issue that Piotr encountered). There was a call to reverse somewhere in the old follow function on master. 😅

Actually, the specification of the ChainSync mini-protocol stipulates that the node should ignore the sort order and returns the youngest point on the chain — but cardano-node behaves differently. See IntersectMBO/ouroboros-network#3443

Is there a case for dropping the sortBy and changing the direction of the sort the DB returns (both wallet db and pool db)?

In light of the ChainSync spec, I would make the case for keeping the sortBy in the chainSync function and instead dropping the sort direction from the DB layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the ChainSync spec, I would make the case for keeping the sortBy in the chainSync function and instead dropping the sort direction from the DB layer.

(But do nothing for now, as this would become obsolete in the DB layer redesign anyway.)

lib/core/src/Cardano/Wallet.hs Show resolved Hide resolved
Copy link
Contributor

@piotr-iohk piotr-iohk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Awesome improvement!

Playing around with the branch a little and It looks that it solves multiple stability issues a.k.a. "vanishing wallets", addressed in:

I've also run this branch against e2e tests few times and no "wallet_not_responding" observed, which was rather common before, now passing e.g.:

https://github.com/input-output-hk/cardano-wallet/actions/runs/1415608203
https://github.com/input-output-hk/cardano-wallet/actions/runs/1412956182
https://github.com/input-output-hk/cardano-wallet/actions/runs/1415762485
https://github.com/input-output-hk/cardano-wallet/actions/runs/1412958475

❤️

@piotr-iohk
Copy link
Contributor

bors r+

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Nov 4, 2021

Build succeeded:

@iohk-bors iohk-bors bot merged commit 75c0f19 into master Nov 4, 2021
@iohk-bors iohk-bors bot deleted the anviking/ADP-871/refactor-chain-following branch November 4, 2021 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants