-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove storage of intermediate checkpoints in restore blocks #2143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
This was a perhaps too-early optimization trying to reduce the time of rolling back. The `sparseCheckpoint` function return an empty list for pretty much the entire restoration, except when reaching the last k blocks where it'll return a list of checkpoints to save, sparse for older blocks and dense near the tip. With the current parameters, blocks are kept if: - Their blockheight is not older than (tip - k) or it is 0 - And their blockheight is a multiple of 100 or, they are near within 10 blocks from the last known block. We currently do this calculation and filtering in two places: 1. In `restoreBlocks`, to pre-filter checkpoints to store in the database 2. In `prune` from the wallet DBLayer, to garbage collect old checkpoints Yet, what (1) buys us is a very little gain on standard wallet, and a huge performance cost on large wallets. So let's analyze the two cases: A/ Small Wallets - The time to create a checkpoint is very small in front of the slot length. - Restoring blocks is fast, (up to 10K blocks per seconds on empty wallets). Therefore, rolling back of 2, 3 blocks or, 100 makes pretty much no difference. Being able to store more checkpoints near the tip adds very little benefits in terms of performances especially, for the first restoration. B/ Large Wallets - The time to create a checkpoint is important in front of the slot length (we've seen up to 4s). - Restoring blocks is still quite fast (the time needed for processing blocks remains quite small in front of the time needed to read and create new checkpoints). The main problem with large wallets occur when the wallet is almost synced and reaches the 10 last blocks of the chain. By trying to store intermediate checkpoints, not only does the wallet spent 10* more time in `restoreBlocks` than normally, but it also keep the database lock for all that duration. Consider the case where the wallet takes 4s to read, and 4s to create a checkpoint, plus some additional 2s to prune them (these are actual data from large exchanges), by default, 10s is spent for creating one checkpoint. And at least 40 more to create the intermediate ones. During this time, between 1 and 3 checkpoints have been created. So it already needs to prune out the last one it spends 12s to create and needs already to create new checkpoints right away. As a consequence, a lot of other functionalities are made needlessly slower than they could be, because for the whole duration of the `restoreBlocks` function, the wallet is holding the database lock. Now, what happen if we avoid storing the "intermediate" checkpoints in restore blocks: blocks near the tip will eventually get stored, but one by one. So, when we _just_ reach the top for the first time, we'll only store the last checkpoint. But then, the next 9 checkpoints created won't be pruned out. So, the worse that can happen is that the wallet is asked to rollback right after we've reached the tip and haven't created many checkpoints yet. Still, We would have at least two checkpoints in the past that are at most 2K blocks from the tip (because we fetch blocks by batches of 1000). So it's important that the batch size remains smaller than `k` so that we can be sure that there's always one checkpoint in the database.
1403d52
to
9eacf44
Compare
9eacf44
to
e477982
Compare
bors r+ |
2143: remove storage of intermediate checkpoints in restore blocks r=KtorZ a=KtorZ # Issue Number <!-- Put here a reference to the issue this PR relates to and which requirements it tackles --> #2035 # Overview <!-- Detail in a few bullet points the work accomplished in this PR --> This was a perhaps too-early optimization trying to reduce the time of rolling back. The `sparseCheckpoint` function return an empty list for pretty much the entire restoration, except when reaching the last k blocks where it'll return a list of checkpoints to save, sparse for older blocks and dense near the tip. With the current parameters, blocks are kept if: - Their blockheight is not older than (tip - k) or it is 0 - And their blockheight is a multiple of 100 or, they are near within 10 blocks from the last known block. We currently do this calculation and filtering in two places: 1. In `restoreBlocks`, to pre-filter checkpoints to store in the database 2. In `prune` from the wallet DBLayer, to garbage collect old checkpoints Yet, what (1) buys us is a very little gain on standard wallet, and a huge performance cost on large wallets. So let's analyze the two cases: A/ Small Wallets - The time to create a checkpoint is very small in front of the slot length. - Restoring blocks is fast, (up to 10K blocks per seconds on empty wallets). Therefore, rolling back of 2, 3 blocks or, 100 makes pretty much no difference. Being able to store more checkpoints near the tip adds very little benefits in terms of performances especially, for the first restoration. B/ Large Wallets - The time to create a checkpoint is important in front of the slot length (we've seen up to 4s). - Restoring blocks is still quite fast (the time needed for processing blocks remains quite small in front of the time needed to read and create new checkpoints). The main problem with large wallets occur when the wallet is almost synced and reaches the 10 last blocks of the chain. By trying to store intermediate checkpoints, not only does the wallet spent 10* more time in `restoreBlocks` than normally, but it also keep the database lock for all that duration. Consider the case where the wallet takes 4s to read, and 4s to create a checkpoint, plus some additional 2s to prune them (these are actual data from large exchanges), by default, 10s is spent for creating one checkpoint. And at least 40 more to create the intermediate ones. During this time, between 1 and 3 checkpoints have been created. So it already needs to prune out the last one it spends 12s to create and needs already to create new checkpoints right away. As a consequence, a lot of other functionalities are made needlessly slower than they could be, because for the whole duration of the `restoreBlocks` function, the wallet is holding the database lock. Now, what happen if we avoid storing the "intermediate" checkpoints in restore blocks: blocks near the tip will eventually get stored, but one by one. So, when we _just_ reach the top for the first time, we'll only store the last checkpoint. But then, the next 9 checkpoints created won't be pruned out. So, the worse that can happen is that the wallet is asked to rollback right after we've reached the tip and haven't created many checkpoints yet. Still, We would have at least two checkpoints in the past that are at most 2K blocks from the tip (because we fetch blocks by batches of 1000). So it's important that the batch size remains smaller than `k` so that we can be sure that there's always one checkpoint in the database. # Comments <!-- Additional comments or screenshots to attach if any --> <!-- Don't forget to: ✓ Self-review your changes to make sure nothing unexpected slipped through ✓ Assign yourself to the PR ✓ Assign one or several reviewer(s) ✓ Once created, link this PR to its corresponding ticket ✓ Assign the PR to a corresponding milestone ✓ Acknowledge any changes required to the Wiki --> Co-authored-by: KtorZ <[email protected]>
Build failed: |
lib/core/src/Cardano/Wallet.hs
Outdated
-- Rollback may still occur during this short period, but | ||
-- rolling back from a few hundred blocks is relatively fast | ||
-- anyway. | ||
cfg = defaultSparseCheckpointsConfig { edgeSize = 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you have SparseCheckpointsConfig
and properties, I think it would be nice to have the preconditions for it to work properly clear and explicit, either through types or through doc comments in the implementation, but not just by restricting generators.
(Mainly cf https://github.com/input-output-hk/cardano-wallet/pull/2143/files#r490008971)
Also moved the epoch stability to the 'SparseCheckpointsConfiguration' since it's mostly static
6575e53
to
aeae316
Compare
@Anviking ☝️ let me know |
The previous version could possibly lead to an unaware developer tweaking parameters in a way that would generate invalid configuration. This new version makes it more difficult / safer.
d6a15b2
to
9b53180
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty intricate 😅 , but👌
-- | ||
-- So, `k / 3` = 720, which should remain around a second of time needed to catch | ||
-- up in case of large rollbacks. | ||
gapSize :: SparseCheckpointsConfig -> Word32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
prop_checkpointsEventuallyEqual | ||
:: GenSparseCheckpointsArgs | ||
-> Property | ||
prop_checkpointsEventuallyEqual args@(GenSparseCheckpointsArgs cfg h) = prop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as per DM I tried digging into this property a bit. My original question was whether the scenario you came up with actually is covered by tests:
Imagine the following scenario:
- We have a checkpoint at [0, 1000]
- Network layer start fetching the next blocks, fetches the first 500, then rollsback internally > 142 blocks back, and fetches the next 500. So it yield in the end 858 blocks.
- We store the last one, so we know have [0, 1000, 1858] so far so good.
- We continue, and it fetches the next batch, so we are now at [0,1000,1858,2858], pruning > occurs and we end up with [0,1000,2858].
- Next batch is fetched, [0,1000,2858,3858], pruning occurs and [0,3858]
And here is the problem. Now, if we rollback, we can only rollback to genesis. Because we've > had this interruption, it created an offset than makes the pruning function discard intermediate > checkpoints that would be worth keeping.
or in some sense that even if we rollback k
blocks, there should be a checkpoint close by, such that we don't have to rollback to genesis.
If I've understood things correctly, this does hold after the wallet has synced 5 (edgeSize) blocks one-by-one at the tip, but not before. Ok — this seems fine.
(anviking/2035/experiments
: holds at the end of prop_checkpointsEventuallyEqual
, and otherwise fails with
quickCheck prop_checkpointsEventuallyEqual
*** Failed! Falsifiable (after 1 test):
GenSparseCheckpointsArgs (SparseCheckpointsConfig {edgeSize = 1, epochStability = 10}) 113
h=113
Batches [[0,1,2,3,4,5,6,7],[8],[9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24],[25,26,27,28,29],[30],[31,32,33,34,35,36,37],[38,39,40,41,42,43,44,45,46],[47,48],[49,50,51,52,53,54,55,56,57],[58,59,60,61,62,63],[64,65,66,67,68,69,70,71,72],[73,74,75,76,77,78,79],[80,81,82,83,84,85,86,87],[88,89,90],[91,92,93,94,95],[96,97,98,99,100,101,102],[103],[104],[105,106,107,108],[109,110,111],[112,113],[114]]
Tip: 46
Prev prev db: SparseCheckpointsDB [0,18,24,29,30]
Prev db: SparseCheckpointsDB [0,30,37]
New db: [0,46]
...which is db 9 out of 25
First unstable: 45
Oldest expected: 33
Cps handling max rollback: []
Future cps: [0,36,39,42,45,46]
)
…nown acceptable limit What we show is that, assuming when the longest chain switch possible is of length k, the wallet will never need to reapply more than k + gapSize blocks, where gapSize is strictly smaller than k. So, this gives the guarantee that we'll never have to re-apply more than 2k blocks in worse scenarios which seems very acceptable.
This timeout is truly to check that the function doesn't blow up and stay within 'acceptable' bound. It fails to pass on some machine in CI, so increasing it a bit but within the same order of magnitude.
bors r+ |
Build succeeded: |
Issue Number
#2035
Overview
This was a perhaps too-early optimization trying to reduce the time of
rolling back. The
sparseCheckpoint
function return an empty list forpretty much the entire restoration, except when reaching the last k
blocks where it'll return a list of checkpoints to save, sparse for
older blocks and dense near the tip. With the current parameters,
blocks are kept if:
10 blocks from the last known block.
We currently do this calculation and filtering in two places:
restoreBlocks
, to pre-filter checkpoints to store in the databaseprune
from the wallet DBLayer, to garbage collect old checkpointsYet, what (1) buys us is a very little gain on standard wallet, and a huge
performance cost on large wallets. So let's analyze the two
cases:
A/ Small Wallets
The time to create a checkpoint is very small in front of the slot
length.
Restoring blocks is fast, (up to 10K blocks per seconds on empty
wallets).
Therefore, rolling back of 2, 3 blocks or, 100 makes pretty much no
difference. Being able to store more checkpoints near the tip adds
very little benefits in terms of performances especially, for the
first restoration.
B/ Large Wallets
The time to create a checkpoint is important in front of the slot
length (we've seen up to 4s).
Restoring blocks is still quite fast (the time needed for processing
blocks remains quite small in front of the time needed to read and create
new checkpoints).
The main problem with large wallets occur when the wallet is almost
synced and reaches the 10 last blocks of the chain. By trying to store
intermediate checkpoints, not only does the wallet spent 10* more time
in
restoreBlocks
than normally, but it also keep the database lockfor all that duration. Consider the case where the wallet takes 4s to
read, and 4s to create a checkpoint, plus some additional 2s to prune
them (these are actual data from large exchanges), by default, 10s is
spent for creating one checkpoint. And at least 40 more to create the
intermediate ones. During this time, between 1 and 3 checkpoints have
been created. So it already needs to prune out the last one it spends
12s to create and needs already to create new checkpoints right away.
As a consequence, a lot of other functionalities are made needlessly
slower than they could be, because for the whole duration of the
restoreBlocks
function, the wallet is holding the database lock.Now, what happen if we avoid storing the "intermediate" checkpoints in
restore blocks: blocks near the tip will eventually get stored, but
one by one. So, when we just reach the top for the first time, we'll
only store the last checkpoint. But then, the next 9 checkpoints
created won't be pruned out. So, the worse that can happen is that the
wallet is asked to rollback right after we've reached the tip and
haven't created many checkpoints yet. Still, We would have at least
two checkpoints in the past that are at most 2K blocks from the tip
(because we fetch blocks by batches of 1000). So it's important that
the batch size remains smaller than
k
so that we can be sure thatthere's always one checkpoint in the database.
Comments