ChainSync Pipelined: moar speed. #1615

KtorZ · 2020-04-30T17:42:13Z

Issue Number

Overview

441eaf1
📍 replace the chain sync client with its pipelined version
No pipeline is done yet, this still uses the standard sequential way for sending requests,
but now using the 'ChainSyncClientPipelined' which has extra constructor we can now use
for pipelining
178b0db
📍 implementing pipelining, pipeline by batches of 1000 requests. It's fast.
This is still work in progress for the 'edge' cases aren't properly handled. Rollbacks are just ignored
and, the area near the tip shouldn't be pipelined but is. Yet, this is a starter that gives some idea.

I haven't tweaked it much, nor tried different settings but so far, I've restored a wallet in ~6:30 minutes
on Mainnet, so more than 10k blocks per second.

37cdeb5
📍 only pipeline when it's safe
This also makes sure to properly handle the starting and terminating
conditions. This is a bit tricky because, ideally, to know whether we
can safely pipeline, we want to check whether any of the pipelined
request would be made within k blocks from the tip. The tricky part is
"blocks". During rollbacks, we have access to points on the chain which
are only referencing slots yet, we need to compare block height!

To solve this, I've adopted two strategies: pipeline and one-by-one,
defaulting to the latter and only using the former when we have enough
information to make a sound decision. This means that the client always
starts by restoring a single block from which it can now figure out a
blockheight and compare this with the node's tip. In case it's far
enough, we go full speed, in steps; checking at each step whether it is
still safe to proceed with the next step.

The last k blocks are therefore fetched one-by-one, which is quite
slow but, it's k blocks out of millions. We could possibly make this
last part even faster by using the same batching strategy as before
(where we would basically check after each request whether the node has
rolled back or not) but it comes with some added complexity I am not
necessarily fond of introducing. This new approach already cuts the
overall restoration time by 2.5!

e.g. on Mainnet:
```
-- pipelined -----------------
- [2020-05-06 12:12:34.40 UTC] 87a2d760: syncProgress: still restoring (0.00%)
- [2020-05-06 12:15:12.80 UTC] 87a2d760: syncProgress: still restoring (50.01%)
-- one by one ----------------
- [2020-05-06 12:18:20.34 UTC] 87a2d760: syncProgress: still restoring (99.94%)
- [2020-05-06 12:18:37.38 UTC] 87a2d760: syncProgress: restored
------------------------------
6 minutes 3s / ~11366 block/s
```
485a324
📍 Regenerate nix

62d4707
📍 adjust log outputs for MsgNodeTip and MsgTxParameters
Going from:

[cardano-wallet.network:Debug:16] [2020-05-07 09:44:29.79 UTC] Network node tip block height is 4129973 at hash 3cf1a60a
[cardano-wallet.network:Info:17] [2020-05-07 09:31:03.66 UTC] TxParams for tip are:  Fee policy:         155381.0 + 43.0x + 0.0y
 Tx max size:        8192

to:

[cardano-wallet.network:Debug:17] [2020-05-07 09:40:30.88 UTC] Network node tip is fe158920-[191.6466#4129961]
[cardano-wallet.network:Info:17] [2020-05-07 09:40:30.88 UTC] TxParams for tip are: [Fee policy: 155381.0 + 43.0x + 0.0y, Tx max size: 8192]

a) avoid multi-line log output with akward structure for tx params
b) Show tip in a consistent way with other loggers (for instance, in the wallet engine).

feb1ecf
📍 fix asynchronous exception handling
There are two sides to this commit:

a) It fixes the asynchronous exception "catcher" in 'follow' that would
fail to catch any exception that occurs while the thread is sleeping. I
noticed that sometimes the logs wouldn't display a proper error.

b) It also fixes the 'runQuery' combinator for our SQLite implementation
to make sure that asynchronous exceptions are masked while executing
sql queries. I falsy thought this was already properly handled by
'runSqlConn' from persistent but it seems to be source of most of our
'ErrorBusy' / 'statementAlreadyFinalized' issues.

:finger_crossed:

Comments

It's fast 🤷‍♂️

lib/byron/src/Cardano/Wallet/Byron/Network.hs

coot · 2020-05-06T13:17:52Z

lib/byron/src/Cardano/Wallet/Byron/Network.hs

+        pipeline (Succ n) respond | natToInt (Succ n) == fromIntegral maxInFlight =
+            P.CollectResponse Nothing $ collectResponses respond [] n
+        pipeline n respond =
+            P.SendMsgRequestNextPipelined $ pipeline (Succ n) respond


I would use CollectResponse (Just (SendMsgRequestNextPipelined ..)) .... This pokes the receiving queue and if we got something it will be available in the next step, so more promptly. You could experiment with some strategies when to use CollectResponse (Just ...) .. and when SendMsgRequestNextPipelined. But maybe you're not doing so because of Jörmungandr support?

But maybe you're not doing so because of Jörmungandr support?

Not really. We process blocks by batches through the wallet, to avoid having to trigger the database and create checkpoint after processing every block. The restoration loop is driving the networking side by asking for next blocks; it is in essence similar to how the ChainSync works; except that instead of request a single block, it requests a batch.

So here, we simply leverage pipelining to collect the next batch more rapidly; it queues "maxInFlight" requests, and then collect them all. Before, we would simply call RequestNext several time, but waiting for each request to complete before calling the next one.

I see; then you could pipline even more, collect and buffer things as they come from the network (on your side) and send batches for processing. If you do your own buffering you will be able to saturate the network.

KtorZ · 2020-05-06T14:19:48Z

bors try

iohk-bors · 2020-05-06T15:23:25Z

try

Build failed

buildkite/cardano-wallet

KtorZ · 2020-05-06T15:46:01Z

then you could pipline even more, collect and buffer things as they come from the network (on your side) and send batches for processing

Yes, I thought of that but it requires a quite fundamental change in the way our networking layer currently operate. In theory, we could actually have the network layer do its own buffering indeed and, while the wallet engine is busy processing blocks and creating checkpoints, the networking layer could already be pipelining the next queries!

Yet, the networking layer is currently driven by the engine on a pull-based mechanism, so it only request the next blocks when asked and this is a synchronous chain of events; I'll try make this a bit more concurrent to see if the complexity is worth adding. Getting below 5 minutes restoration time would be pretty nice :D

Anviking

lgtm, and tried restoring with Daedalus Mainnet

KtorZ · 2020-05-07T12:01:18Z

bors r+

1615: ChainSync Pipelined: moar speed. r=KtorZ a=KtorZ # Issue Number  ADP-270 # Overview  - 1e43ecd 📍 **replace the chain sync client with its pipelined version** No pipeline is done yet, this still uses the standard sequential way for sending requests, but now using the 'ChainSyncClientPipelined' which has extra constructor we can now use for pipelining - 461c4d7 📍 **implementing pipelining, pipeline by batches of 1000 requests. It's fast.** This is still work in progress for the 'edge' cases aren't properly handled. Rollbacks are just ignored and, the area near the tip shouldn't be pipelined but is. Yet, this is a starter that gives some idea. I haven't tweaked it much, nor tried different settings but so far, I've restored a wallet in ~6:30 minutes on Mainnet, so more than 10k blocks per second. - 76c9f9a 📍 **only pipeline when it's safe** This also makes sure to properly handle the starting and terminating conditions. This is a bit tricky because, ideally, to know whether we can safely pipeline, we want to check whether any of the pipelined request would be made within k blocks from the tip. The tricky part is "blocks". During rollbacks, we have access to points on the chain which are only referencing slots yet, we need to compare block height! To solve this, I've adopted two strategies: pipeline and one-by-one, defaulting to the latter and only using the former when we have enough information to make a sound decision. This means that the client always starts by restoring a single block from which it can now figure out a blockheight and compare this with the node's tip. In case it's far enough, we go full speed, in steps; checking at each step whether it is still safe to proceed with the next step. The last `k` blocks are therefore fetched one-by-one, which is quite slow but, it's `k` blocks out of millions. We _could_ possibly make this last part even faster by using the same batching strategy as before (where we would basically check after each request whether the node has rolled back or not) but it comes with some added complexity I am not necessarily fond of introducing. This new approach already cuts the overall restoration time by 2.5! e.g. on Mainnet: ``` -- pipelined ----------------- - [2020-05-06 12:12:34.40 UTC] 87a2d760: syncProgress: still restoring (0.00%) - [2020-05-06 12:15:12.80 UTC] 87a2d760: syncProgress: still restoring (50.01%) -- one by one ---------------- - [2020-05-06 12:18:20.34 UTC] 87a2d760: syncProgress: still restoring (99.94%) - [2020-05-06 12:18:37.38 UTC] 87a2d760: syncProgress: restored ------------------------------ 6 minutes 3s / ~11366 block/s ``` # Comments  It's fast 🤷‍♂️  Co-authored-by: KtorZ <[email protected]> Co-authored-by: IOHK <[email protected]>

iohk-bors · 2020-05-07T12:47:28Z

Build failed

buildkite/cardano-wallet

KtorZ · 2020-05-07T13:16:02Z

    [cardano-wallet.wallet-engine:Error:68351] [2020-05-07 12:23:58.46 UTC] b502fca0: Unexpected error following the chain: user error (restoreBlocks: given chain isn't a valid continuation. Wallet is at: fd77ead9-[0.0#0] but the given chain continues starting from: dc23b4f4-[0.1#2])

Interesting.

No pipeline is done yet, this still uses the standard sequential way for sending requests, but now using the 'ChainSyncClientPipelined' which has extra constructor we can now use for pipelining

…ast. This is still work in progress for the 'edge' cases aren't properly handled. Rollbacks are just ignored and, the area near the tip shouldn't be pipelined but is. Yet, this is a starter that gives some idea. I haven't tweaked it much, nor tried different settings but so far, I've restored a wallet in ~6:30 minutes on Mainnet, so more than 10k blocks per second.

This also makes sure to properly handle the starting and terminating conditions. This is a bit tricky because, ideally, to know whether we can safely pipeline, we want to check whether any of the pipelined request would be made within k blocks from the tip. The tricky part is "blocks". During rollbacks, we have access to points on the chain which are only referencing slots yet, we need to compare block height! To solve this, I've adopted two strategies: pipeline and one-by-one, defaulting to the latter and only using the former when we have enough information to make a sound decision. This means that the client always starts by restoring a single block from which it can now figure out a blockheight and compare this with the node's tip. In case it's far enough, we go full speed, in steps; checking at each step whether it is still safe to proceed with the next step. The last `k` blocks are therefore fetched one-by-one, which is quite slow but, it's `k` blocks out of millions. We _could_ possibly make this last part even faster by using the same batching strategy as before (where we would basically check after each request whether the node has rolled back or not) but it comes with some added complexity I am not necessarily fond of introducing. This new approach already cuts the overall restoration time by 2.5! e.g. on Mainnet: ``` -- pipelined ----------------- - [2020-05-06 12:12:34.40 UTC] 87a2d760: syncProgress: still restoring (0.00%) - [2020-05-06 12:15:12.80 UTC] 87a2d760: syncProgress: still restoring (50.01%) -- one by one ---------------- - [2020-05-06 12:18:20.34 UTC] 87a2d760: syncProgress: still restoring (99.94%) - [2020-05-06 12:18:37.38 UTC] 87a2d760: syncProgress: restored ------------------------------ 6 minutes 3s / ~11366 block/s ```

Going from: ``` [cardano-wallet.network:Debug:16] [2020-05-07 09:44:29.79 UTC] Network node tip block height is 4129973 at hash 3cf1a60a [cardano-wallet.network:Info:17] [2020-05-07 09:31:03.66 UTC] TxParams for tip are: Fee policy: 155381.0 + 43.0x + 0.0y Tx max size: 8192 to: ``` [cardano-wallet.network:Debug:17] [2020-05-07 09:40:30.88 UTC] Network node tip is fe158920-[191.6466#4129961] [cardano-wallet.network:Info:17] [2020-05-07 09:40:30.88 UTC] TxParams for tip are: [Fee policy: 155381.0 + 43.0x + 0.0y, Tx max size: 8192] ``` a) avoid multi-line log output with akward structure for tx params b) Show tip in a consistent way with other loggers (for instance, in the wallet engine).

There are two sides to this commit: a) It fixes the asynchronous exception "catcher" in 'follow' that would fail to catch any exception that occurs while the thread is sleeping. I noticed that sometimes the logs wouldn't display a proper error. b) It also fixes the 'runQuery' combinator for our SQLite implementation to make sure that asynchronous exceptions are masked while executing sql queries. I falsy thought this was already properly handled by 'runSqlConn' from persistent but it seems to be source of most of our 'ErrorBusy' / 'statementAlreadyFinalized' issues. :finger_crossed:

KtorZ · 2020-05-07T16:13:50Z

bors r+

1615: ChainSync Pipelined: moar speed. r=KtorZ a=KtorZ # Issue Number  ADP-270 # Overview  - 441eaf1 📍 **replace the chain sync client with its pipelined version** No pipeline is done yet, this still uses the standard sequential way for sending requests, but now using the 'ChainSyncClientPipelined' which has extra constructor we can now use for pipelining - 178b0db 📍 **implementing pipelining, pipeline by batches of 1000 requests. It's fast.** This is still work in progress for the 'edge' cases aren't properly handled. Rollbacks are just ignored and, the area near the tip shouldn't be pipelined but is. Yet, this is a starter that gives some idea. I haven't tweaked it much, nor tried different settings but so far, I've restored a wallet in ~6:30 minutes on Mainnet, so more than 10k blocks per second. - 37cdeb5 📍 **only pipeline when it's safe** This also makes sure to properly handle the starting and terminating conditions. This is a bit tricky because, ideally, to know whether we can safely pipeline, we want to check whether any of the pipelined request would be made within k blocks from the tip. The tricky part is "blocks". During rollbacks, we have access to points on the chain which are only referencing slots yet, we need to compare block height! To solve this, I've adopted two strategies: pipeline and one-by-one, defaulting to the latter and only using the former when we have enough information to make a sound decision. This means that the client always starts by restoring a single block from which it can now figure out a blockheight and compare this with the node's tip. In case it's far enough, we go full speed, in steps; checking at each step whether it is still safe to proceed with the next step. The last `k` blocks are therefore fetched one-by-one, which is quite slow but, it's `k` blocks out of millions. We _could_ possibly make this last part even faster by using the same batching strategy as before (where we would basically check after each request whether the node has rolled back or not) but it comes with some added complexity I am not necessarily fond of introducing. This new approach already cuts the overall restoration time by 2.5! e.g. on Mainnet: ``` -- pipelined ----------------- - [2020-05-06 12:12:34.40 UTC] 87a2d760: syncProgress: still restoring (0.00%) - [2020-05-06 12:15:12.80 UTC] 87a2d760: syncProgress: still restoring (50.01%) -- one by one ---------------- - [2020-05-06 12:18:20.34 UTC] 87a2d760: syncProgress: still restoring (99.94%) - [2020-05-06 12:18:37.38 UTC] 87a2d760: syncProgress: restored ------------------------------ 6 minutes 3s / ~11366 block/s ``` - 485a324 📍 **Regenerate nix** - 62d4707 📍 **adjust log outputs for MsgNodeTip and MsgTxParameters** Going from: ``` [cardano-wallet.network:Debug:16] [2020-05-07 09:44:29.79 UTC] Network node tip block height is 4129973 at hash 3cf1a60a [cardano-wallet.network:Info:17] [2020-05-07 09:31:03.66 UTC] TxParams for tip are: Fee policy: 155381.0 + 43.0x + 0.0y Tx max size: 8192 ``` to: ``` [cardano-wallet.network:Debug:17] [2020-05-07 09:40:30.88 UTC] Network node tip is fe158920-[191.6466#4129961] [cardano-wallet.network:Info:17] [2020-05-07 09:40:30.88 UTC] TxParams for tip are: [Fee policy: 155381.0 + 43.0x + 0.0y, Tx max size: 8192] ``` a) avoid multi-line log output with akward structure for tx params b) Show tip in a consistent way with other loggers (for instance, in the wallet engine). - feb1ecf 📍 **fix asynchronous exception handling** There are two sides to this commit: a) It fixes the asynchronous exception "catcher" in 'follow' that would fail to catch any exception that occurs while the thread is sleeping. I noticed that sometimes the logs wouldn't display a proper error. b) It also fixes the 'runQuery' combinator for our SQLite implementation to make sure that asynchronous exceptions are masked while executing sql queries. I falsy thought this was already properly handled by 'runSqlConn' from persistent but it seems to be source of most of our 'ErrorBusy' / 'statementAlreadyFinalized' issues. :finger_crossed: # Comments  It's fast 🤷‍♂️  1635: Prevent bors from merging into any branch except master r=KtorZ a=rvl Another try at #1632. 1639: Post v2020-05-06 updates r=piotr-iohk a=piotr-iohk # Issue Number https://github.com/input-output-hk/cardano-wallet/wiki/Release-Checklist#publication # Overview - 2ed11d1 Update README compat table - 0c2bab7 Update revisions for migration tests # Comments   Co-authored-by: KtorZ <[email protected]> Co-authored-by: IOHK <[email protected]> Co-authored-by: Rodney Lorrimar <[email protected]> Co-authored-by: Piotr Stachyra <[email protected]>

KtorZ · 2020-05-07T17:23:20Z

Hydra timing out again.

iohk-bors · 2020-05-07T17:27:21Z

Build failed (retrying...)

buildkite/cardano-wallet

iohk-bors · 2020-05-07T17:27:36Z