Skip to content
This repository has been archived by the owner on Aug 18, 2020. It is now read-only.

[CBR-471] Fix Restoration Performance Regression #3786

Merged
merged 2 commits into from
Oct 25, 2018

Conversation

KtorZ
Copy link
Contributor

@KtorZ KtorZ commented Oct 24, 2018

Description

I am still testing this (I had an issue with the very last block application causing the checkpoints to diverge in a non expected way). I believe it's now fixed but I have to wait full restoration.
edit: Now tested, everything seems fine (see comment below)

We use to perform a db transaction per block which turns out to slow down the process quite a lot. Instead, we can apply k (.e.g 1000) updates inside one transaction and move on to the next batch. There's a bit of fiddling to do to get things working, but it turns out to be pretty efficient (~600 block/s on my setup, vs 80 block/s without the patch).

The main trick is located in the new applyHistoricalBlocks which applies update in sequence, stopping at the first error encountered.

Besides, we should have cought the performance regression earlier. We can't realistically perform restoration as part of our unit or main integration tests because of the time it takes. However, it is perfectly realistic to have it ran nightly, and monitor its output to be inside accepted boundaries.

Linked issue

[CBR-471]

Type of change

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🛠 New feature (non-breaking change which adds functionality)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🏭 Refactoring that does not change existing functionality but does improve things like code readability, structure etc
  • 🔨 New or improved tests for existing code
  • ⛑ git-flow chore (backport, hotfix, etc)

Developer checklist

  • I have read the style guide document, and my code follows the code style of this project.
  • If my code deals with exceptions, it follows the guidelines.
  • I have updated any documentation accordingly, if needed. Documentation changes can be reflected in opening a PR on cardanodocs.com, amending the inline Haddock comments, any relevant README file or one of the document listed in the docs directory.
  • CHANGELOG entry has been added and is linked to the correct PR on GitHub.

Testing checklist

  • I have added tests to cover my changes.
  • All new and existing tests passed.

QA Steps

nix-build -A acceptanceTests.mainnet.full -o acceptance-tests && ./acceptance-tests

... then go for a long coffee ☕

Screenshots (if available)

Launching wallet node: /nix/store/19ls1ww691cq0z32196alq82js761pj4-wallet-connect-to-mainnet
22:05:09 Blockchain syncing...
22:05:19 Blockchain syncing: 81%  1399744/1709971 blocks
22:05:32 Blockchain syncing: 82%  1403775/1709971 blocks
22:05:43 Blockchain syncing: 82%  1409791/1709972 blocks
...
22:15:55 Blockchain syncing: 98%  1691570/1710002 blocks
22:16:05 Blockchain syncing: 99%  1697522/1710003 blocks
22:16:15 Blockchain syncing: 99%  1703218/1710003 blocks
22:16:25 Blockchain syncing: 99%  1705969/1710004 blocks
Blockchain synced: 100%  1710004 blocks

Blockchain sync complete!
Blockchain height: 1710004 blocks
Elapsed time: 686 seconds

Going to restore a wallet
[[{"data":{"id":"Ae2tdPwUPEZ1UMga1bnFFD7FoFN9ZFoPLCWLaGcqNnfovcrcT3oMSuAz8DD","name":"Acceptance Wallet","balance":0,"hasSpendingPassword":false,"spendingPasswordLastUpdate":"2018-10-24T20:16:35.570007","createdAt":"2018-10-24T20:16:50.106258","assuranceLevel":"normal","syncState":{"tag":"restoring","data":{"throughput":{"quantity":0,"unit":"blocksPerSecond"},"percentage":{"quantity":0,"unit":"percent"},"estimatedCompletionTime":{"quantity":1710454000,"unit":"milliseconds"}}},"type":"regular"},"status":"success","meta":{"pagination":{"totalPages":1,"page":1,"perPage":1,"totalEntries":1}}}

22:16:50 Wallet restoring: 0% (0 blocks/s)
22:17:00 Wallet restoring: 0% (415 blocks/s)
22:17:10 Wallet restoring: 1% (622 blocks/s)
22:17:20 Wallet restoring: 1% (627.4 blocks/s)
...
22:56:46 Wallet restoring: 97% (715.975 blocks/s)
22:56:56 Wallet restoring: 98% (716.302 blocks/s)
22:57:06 Wallet restoring: 98% (716.7 blocks/s)
22:57:16 Wallet restoring: 99% (717.754 blocks/s)
22:57:26 Wallet restoring: 99% (718.192 blocks/s)
22:57:36 Wallet restoring: null% (715.272 blocks/s)
Restoration complete!
Elapsed time: 2484 seconds

Note: There was a typo in the script when I ran it causing it not to terminate properly once synced. I've verified the log file and no errors related to sync (some lookup network failures but nothing unusual).
Plus, checking the wallet manually at the end:

{
   "data" : [
      {
         "id" : "Ae2tdPwUPEZ1UMga1bnFFD7FoFN9ZFoPLCWLaGcqNnfovcrcT3oMSuAz8DD",
         "spendingPasswordLastUpdate" : "2018-10-24T20:16:50.106258",
         "hasSpendingPassword" : false,
         "type" : "regular",
         "syncState" : {
            "data" : null,
            "tag" : "synced"
         },
         "createdAt" : "2018-10-24T20:16:50.106258",
         "balance" : 0,
         "name" : "Acceptance Wallet",
         "assuranceLevel" : "normal"
      }
   ],
   "meta" : {
      "pagination" : {
         "totalEntries" : 1,
         "totalPages" : 1,
         "page" : 1,
         "perPage" : 10
      }
   },
   "status" : "success"
}

@KtorZ KtorZ requested review from erikd, edsko, dcoutts and karknu October 24, 2018 18:11
@KtorZ KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch 2 times, most recently from 1a89260 to 96965af Compare October 24, 2018 21:57
Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, though I've not fully checked everything. Some suggestions below.

@@ -379,54 +381,50 @@ restoreWalletHistoryAsync wallet rootId prefilter progress start (tgtHash, tgtSl
startingPoint <- case start of
Nothing -> withNode $ getFirstGenesisBlockHash genesisHash
Just sh -> nextHistoricalHash sh >>= maybe (throwM $ RestorationSuccessorNotFound sh) pure
restore genesisHash startingPoint NoTimingData
restore genesisHash startingPoint 1000
where
wId :: WalletId
wId = WalletIdHdRnd rootId

-- Process the restoration of the block with the given 'HeaderHash'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment would now be out of date right? If I followed correctly, it's restore starting from the block with the given hash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 👍

lift (nextHistoricalHash currentHash) >>= \case
Nothing -> throwM (RestorationFinishUnreachable tgtHash currentHash)
Just hh' -> put (hh', False)
lift $ getPrefilteredBlockOrThrow genesisHash currentHash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a state monad here seems a bit odd to me. This is just keeping the currentHash as the running value right? I think I'd just write this as a local recursive action with the currentHash as a local arg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried both approach actually and found the StateMonad the most readable of the two. The "edge" case makes the recursive-function approach a bit ugly :/

@erikd
Copy link
Member

erikd commented Oct 25, 2018

tried running the Nix scipt myself and first off, got a few locale related errors:

$ nix-build -A acceptanceTests.mainnet.full -o acceptance-tests && ./acceptance-tests
warning: dumping very large path (> 256 MiB); this may run out of memory
/nix/store/73pfw2r6c9g9v5rnyzwg38fm2ap3j8fd-acceptance-tests-mainnet
/nix/store/0pc9pzaqdbygxr2nyng5h71mkwbwv2fc-bash-4.4-p12/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_AU.UTF-8)
./acceptance-tests: line 23: warning: setlocale: LC_ALL: cannot change locale (en_GB.UTF-8): No such file or directory
Launching wallet node: /nix/store/g5m0r33dxw9zjv88zypbbgb437jfixam-wallet-connect-to-mainnet

but it did seem to sync:

....
12:37:16 Blockchain syncing: 97%  1673651/1710959 blocks
12:37:29 Blockchain syncing: 98%  1684402/1710959 blocks
12:37:40 Blockchain syncing: 98%  1692786/1710959 blocks
12:37:50 Blockchain syncing: 99%  1704754/1710959 blocks
Blockchain synced: 100%  1710959 blocks

Blockchain sync complete!
Blockchain height: 1710959 blocks
Elapsed time: 1846 seconds

but then during wallet restore I got:

Going to restore a wallet
{"data":{"id":"Ae2tdPwUPEZ1UMga1bnFFD7FoFN9ZFoPLCWLaGcqNnfovcrcT3oMSuAz8DD","name":"Acceptance Wallet","balance":0,"hasSpendingPassword":false,"spendingPasswordLastUpdate":"2018-10-25T01:38:00.112482","createdAt":"2018-10-25T01:38:07.108056","assuranceLevel":"normal","syncState":{"tag":"restoring","data":{"throughput":{"quantity":0,"unit":"blocksPerSecond"},"percentage":{"quantity":0,"unit":"percent"},"estimatedCompletionTime":{"quantity":1711414000,"unit":"milliseconds"}}},"type":"regular"},"status":"success","meta":{"pagination":{"totalPages":1,"page":1,"perPage":1,"totalEntries":1}}}jq: error (at <stdin>:1): Cannot index string with string "data"
jq: error (at <stdin>:1): Cannot index string with string "data"
12:38:07 Wallet restoring...
jq: error (at <stdin>:1): Cannot index string with string "data"
jq: error (at <stdin>:1): Cannot index string with string "data"

@KtorZ KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 96965af to 257f610 Compare October 25, 2018 06:13
@KtorZ
Copy link
Contributor Author

KtorZ commented Oct 25, 2018

@erikd Yes indeed, my little patch yesterday just made it worse 😅 ... Now fixed properly.

@KtorZ KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 257f610 to 6fdffdd Compare October 25, 2018 06:29
wallet-new/src/Cardano/Wallet/Kernel/DB/AcidState.hs Outdated Show resolved Hide resolved

-- Flush to Acid-State & SQLite
k <- getSecurityParameter (wallet ^. walletNode)
let (blocks, txMetas, slotIds) = unzip3 (catMaybes updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rewrite this code slightly; after discussing this code with @KtorZ , the catMaybes here has a dual purpose: any EBBs will be Nothing, and once we reach the final block halfway the batch, all the elements after that final block will also be Nothing. I find the style a bit confusing, and would probably prefer an explicit recursion style with an accumulator. That said, I don't think it's wrong what's here; but I think it does at least need a comment.

@edsko edsko requested a review from adinapoli-iohk October 25, 2018 06:43
Copy link
Contributor

@adinapoli-iohk adinapoli-iohk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- I don't have anything major to add on top of what Edsko & Duncan already said. Nice work!

curl --silent --cacert ${stateDir}/tls/client/ca.crt --cert ${stateDir}/tls/client/client.pem -H "Content-Type: application/json" https://${wallet.walletListen}/api/v1/wallets \
-d '{
"operation": "restore",
"backupPhrase": ["session", "ring", "phone", "arrange", "notice", "gap", "media", "olympic", "water", "road", "spider", "rate"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully people won't try to use this one for production wallets 😉

(I always feel like we should have some mechanism for preventing mainnet to accept any of the mnemonics we use in tests etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I thought the same things 😶 ... I hope it's hidden enough in the nix entrails that no one will dig this out... right ? 😨

wallet-new/src/Cardano/Wallet/Kernel/DB/AcidState.hs Outdated Show resolved Hide resolved
@@ -379,54 +381,51 @@ restoreWalletHistoryAsync wallet rootId prefilter progress start (tgtHash, tgtSl
startingPoint <- case start of
Nothing -> withNode $ getFirstGenesisBlockHash genesisHash
Just sh -> nextHistoricalHash sh >>= maybe (throwM $ RestorationSuccessorNotFound sh) pure
restore genesisHash startingPoint NoTimingData
restore genesisHash startingPoint 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor one -- might be nice to explicitly have a where clause for this "magic number", to clarify its purpose.

@KtorZ
Copy link
Contributor Author

KtorZ commented Oct 25, 2018

    getBatch genesisHash hh batchSize = go [] batchSize hh
        where
            go !updates !n !currentHash | n <= 0 =
                return (updates, currentHash)

            go !updates _ !currentHash | currentHash == tgtHash =
                getPrefilteredBlockOrThrow genesisHash currentHash >>= \case
                    Nothing -> go updates 0 currentHash
                    Just u  -> go (updates ++ [u]) 0 currentHash

            go !updates !n !currentHash = do
                nextHash <- nextHistoricalHash currentHash >>= \case
                    Nothing  -> throwM (RestorationFinishUnreachable tgtHash currentHash)
                    Just hh' -> return hh'

                getPrefilteredBlockOrThrow genesisHash currentHash >>= \case
                    Nothing -> go updates (n - 1) nextHash
                    Just u  -> go (updates ++ [u]) (n - 1) nextHash

Currently trying out this. I found a nice way to put this. A bit afraid about the boundary case so I'll go through a full restoration once again and see :)

@KtorZ KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 6fdffdd to 9cb1a58 Compare October 25, 2018 08:09
@KtorZ
Copy link
Contributor Author

KtorZ commented Oct 25, 2018

Addressed comments above; went for the algorithm described above with accumulators and recursion instead of the StateMonad. I've re-ran the entire script, caught another error in the acceptance bash script (god I hate bash) and fixed it as well.

KtorZ added 2 commits October 25, 2018 10:49
We use to perform a db transaction per block which turns out
to slow down the process quite a lot.
Instead, we can apply k (.e.g 1000) updates inside one transaction
and move on to the next batch. There's a bit of fiddling to do to
get things working, but it turns out to be pretty efficient.

The main trick is located in the new `applyHistoricalBlocks` which
applies update in sequence, stopping at the first error encountered.
We should have cought the performance regression earlier. We can't realistically perform restoration
as part of our unit or main integration tests because of the time it takes. However, it is perfectly
realistic to have it ran nightly, and monitor its output to be inside accepted boundaries.
@KtorZ KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 9cb1a58 to 80c4b5b Compare October 25, 2018 08:49
@KtorZ KtorZ removed the request for review from karknu October 25, 2018 09:58
Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

Copy link
Contributor

@rvl rvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Bash is bash.
The build timeouts are set in .buildkite/nightly.yml.

@KtorZ
Copy link
Contributor Author

KtorZ commented Oct 25, 2018

Integration tests still failing for an unrelated reason:

  integration/TransactionSpecs.hs:102:22: 
  1) Transactions asset-locked wallets can receive funds and transactions are confirmed in index
       Falsifiable (after 1 test):
       
       not expected: 0

Merging anyway 👍

@KtorZ KtorZ merged commit 150fb11 into develop Oct 25, 2018
@KtorZ KtorZ deleted the KtorZ/CBR-471/fix-restoration-perf-regression branch October 25, 2018 13:28
KtorZ added a commit that referenced this pull request Nov 9, 2018
…ration-perf-regression

[CBR-471] Fix Restoration Performance Regression
KtorZ added a commit to input-output-hk/cardano-wallet-legacy that referenced this pull request Nov 9, 2018
…hk/KtorZ/CBR-471/fix-restoration-perf-regression

[CBR-471] Fix Restoration Performance Regression
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants