Skip to content

Client: Calaveras Sync Improvements + Working Branch#1310

Merged
holgerd77 merged 8 commits into
masterfrom
calaveras-sync
Jun 22, 2021
Merged

Client: Calaveras Sync Improvements + Working Branch#1310
holgerd77 merged 8 commits into
masterfrom
calaveras-sync

Conversation

@holgerd77

Copy link
Copy Markdown
Member

Related: #1306

Working branch to sync with the Calaveras network, also coming with some calaveras related improvements (feel free to directly push on this branch or in doubt make a PR towards the branch if you have got things to add).

First thing I added here is an official maxPerRequest CLI option, then we don't always have to monkey-patch if running with a 50 maxPerRequest setting doesn't work but can just run npm run client:start -- --network=calaveras --maxPerRequest=1.

The option should be generally useful (so: not just a monkey-patch-ease option 🙂) since it will be likely useful for peopel to play a bit with this, this will likely allow to get to somewhat higher download rates if adjusted due to the specific network and ones own internet connection speed and things like that.

@codecov

codecov Bot commented Jun 21, 2021

Copy link
Copy Markdown

Codecov Report

Merging #1310 (2192982) into master (855e154) will decrease coverage by 0.17%.
The diff coverage is 44.44%.

Impacted file tree graph

Flag Coverage Δ
block 86.20% <ø> (-0.25%) ⬇️
blockchain 83.46% <ø> (+0.06%) ⬆️
client 84.00% <44.44%> (-0.68%) ⬇️
common 88.12% <ø> (ø)
devp2p 84.00% <ø> (ø)
ethash 82.83% <ø> (ø)
tx 89.49% <ø> (-0.12%) ⬇️
vm 76.83% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@holgerd77

Copy link
Copy Markdown
Member Author

This should also be helpful and give some more orientation and feeling for the network behavior: outputting the base fee during sync along london blocks or later:

grafik

@holgerd77

Copy link
Copy Markdown
Member Author

Yeah! 🎉

This db72a6d actually fixes the maxPerRequest issue and sync is now working with the normal 50 block requests set 😄, we can keep the maxPerRequest option nevertheless I guess.

The block fetcher didn't allow for result sets lower than the requested number before and on Calaveras sync a cut happened on the result set in the first spam-blocks-range (2000-2200 or so) due to the large block sizes (as happened on Baikal before). The fix now allows for results with a lower lengths to be processed (a zero length is still not allowed though to prevent circular requests) and then re-calcuates the first block numbers of the outstanding jobs in the request queue. This should make the client sync generally more resilient for these kind of situations.

@holgerd77

Copy link
Copy Markdown
Member Author

Ah, also wanted to post this screenshot along:

grafik

Sync is incredibly slow along these blocks. But at the end, these are also an awful lot of transactions. Might nevertheless be a good test bed to have a look for some bootlenecks (I guess these are all simple value transfer txs to the same account. I did some tests here with this permanent cache (surviving checkpointing) for MPT idea (//cc @gabrocheleau) but this had not really an effect here (which makes sense since there is just not much state touched with this).

@holgerd77

Copy link
Copy Markdown
Member Author

Don't want to side track too much, but I did some rough performance analysis by running the client on the spammed calaveras blocks with the VM debugger DEBUG=vm:* npm run client:start -- --network=calaveras and had a closer look where the ms diff outputs are (this is actually a not too shabby method to get some pointers on where bootlenecks are. The places for such diffs are very much repeatable, one has to look closely if the debug output is actually before or after an associated method call (at some point we should unify and always call after the execution, makes more sense).

Here is a picture of some associated debug output where I then added some console.log() statements outputting time diffs for closer analysis.

grafik

This is the associated code in VM runBlock.ts:

// Add to total block gas usage
gasUsed = gasUsed.add(txRes.gasUsed)
if (this.DEBUG) {
  debug(`Add tx gas used (${txRes.gasUsed}) to total block gas usage (-> ${gasUsed})`)
}
console.log(`B1${(new Date()).getSeconds()}:${(new Date()).getMilliseconds()}`)
// Combine blooms via bitwise OR
bloom.or(txRes.bloom)
console.log(`B2${(new Date()).getSeconds()}:${(new Date()).getMilliseconds()}`)

// Add receipt to trie to later calculate receipt root
receipts.push(txRes.receipt)
console.log(`B3${(new Date()).getSeconds()}:${(new Date()).getMilliseconds()}`)
const encodedReceipt = encodeReceipt(tx, txRes.receipt)
console.log(`B4${(new Date()).getSeconds()}:${(new Date()).getMilliseconds()}`)
const encoded = encode(txIdx)
console.log(`B4.2${(new Date()).getSeconds()}:${(new Date()).getMilliseconds()}`)
await receiptTrie.put(encoded, encodedReceipt)
console.log(`B5${(new Date()).getSeconds()}:${(new Date()).getMilliseconds()}`)

The large time diff is happening between B4.2 and B5 and can therefore clearly attributed to the update of the receiptTrie.

I find this pretty interesting. On the one hand it is the Trie library, so the usual suspects. 😋 On the other hand I totally not had these smaller trie usages like the receipt trie - might be the same for the tx trie e.g. - on the radar, so we might also want to give this some focus. This is interesting because these tries should be pretty small.

So too things which might be drawn from this short experiment:

  1. We might want to look if especially the execution of small trie setups can be improved (made more performant)
  2. We might dedicately want to look at the trie usage code parts and integrations for these smaller tries

@holgerd77

Copy link
Copy Markdown
Member Author

(side note: I have no idea why this stupid client integration test is failing 🙁)

@holgerd77

Copy link
Copy Markdown
Member Author

But apart from that: Calaveras is actually syncing, just synced up successfully up to block 60.142 (see block exporer for reference on latest block). 😄

grafik

@holgerd77

Copy link
Copy Markdown
Member Author

Still not sure about the test failure though, this is happening between commit 3b8d315 and 2a00695, if someone wants to give it a try run:

git checkout 3b8d315
npx tape -r ts-node/register test/integration/fullsync.spec.ts

Should work.

git checkout 2a00695
npx tape -r ts-node/register test/integration/fullsync.spec.ts

Should fail.

Really strange though, if I didn't miss something the extra if else clause from 2a00695 is not even run once along the test run. 🤨

@ryanio

ryanio commented Jun 21, 2021

Copy link
Copy Markdown
Contributor

The large time diff is happening between B4.2 and B5 and can therefore clearly attributed to the update of the receiptTrie.

I might be missing something but doesn't your screenshot show:

B4.245:486
B545:487

Doesn't that show only :001 ms elapsed between them?

@ryanio

ryanio commented Jun 22, 2021

Copy link
Copy Markdown
Contributor

Ok this failing integration test was definitely an interesting and tricky one. So it sets up two peers, one with 9 blocks and one with 10 blocks, and wants to ensure that it syncs all 10 blocks.

In the code prior to this PR, it would reject the 9 block result since it requested 10. This is actually not really how we would have wanted the test to behave, so I'm happy we have some new and improved code here.

The test fails because it receives the 'synchronized' result with 9 blocks first and fails expecting 10.

I updated the integration test to wait to listen for a possible next event and check that the length is correct.

I also cleaned up some of the no-unnecessary-condition since it is safe to just check the length of the object.

Will rebase on master.

@holgerd77

Copy link
Copy Markdown
Member Author

The large time diff is happening between B4.2 and B5 and can therefore clearly attributed to the update of the receiptTrie.

I might be missing something but doesn't your screenshot show:

B4.245:486
B545:487

Doesn't that show only :001 ms elapsed between them?

Yes, I guess that's the time diff range we are operating in here. That's 1 ms and if you do the math that means for a single block with 1000 txs this makes a difference of 1s (or half a second if we would let's say double the performance). That's not nothing respectively substantial, if we catch 3-4 of such somewhat smaller bottlenecks and could improve here that would already mean a performance increase of some seconds per block. Correct me if I got something wrong here? 🤔

Thanks a lot for digging into this test issue, really grateful here 🙂 🌻, I guess I would likely not have found this myself. I also have the impression that these test code parts produced problems before occasionally, so would be great if we now have a structural improvement here which prevents this on future test runs.

Since you commented on all the changes I've done and Calaveras is actually syncing successfully I guess I will just take this as an informal review and then just merge here. There are not really outstanding tasks and one benefit of this might be - if someone finds the time - that it would be a good occasion to test and improve on the on-tip-of-the-chain behavior from #1132 as long as Calaveras is still up and running. 😄

@holgerd77 holgerd77 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take approving comments from @ryanio as an informal review and merge here (see previous comment for some reasoning).

@holgerd77 holgerd77 merged commit 75aff49 into master Jun 22, 2021
@holgerd77 holgerd77 deleted the calaveras-sync branch June 22, 2021 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants