Skip to content

Client: keep syncing at tip of chain#1132

Merged
holgerd77 merged 23 commits into
masterfrom
continous-syncing
Aug 20, 2021
Merged

Client: keep syncing at tip of chain#1132
holgerd77 merged 23 commits into
masterfrom
continous-syncing

Conversation

@jochem-brouwer

Copy link
Copy Markdown
Member

This PR intends to keep the fullsync-syncer syncing at the tip of the chain. (Note: current branch target is the YoloV3 branch)

  • Add enqueueTask to fetcher. This allows one to manually enqueue a task, and possibly restart the fetcher
  • Add an option to not destroy the fetcher when it has processed all tasks
  • Optimize full sync: in case no peers are available, then wait 5 seconds and try again, to get a best peer (this possibly fixes the re-bootstrap at some points)
  • Fullsync sync method never resolves.
  • Enqueue new blocks when the best peer announces new blocks

This is very WIP. Here are some things which need to be resolved:

  • Add logic when there is a gap in requested blocks (possibly, we just change the block fetcher, such that we can just update the highest block number on-demand, and not add all this extra logic)
  • Add tests
  • Fix the error as described below

Currently, it seems to randomly work for some blocks (when we are at tip of chain, i.e. on YoloV3, every 15 seconds it imports and executes a new block), but then at some point it throws:

  devp2p:eth Received NEW_BLOCK_HASHES message from 3.9.20.133:30303: e6e5a0dcc51743c5579b7bcdfadda2209ee2280e16e3e065d74333ba9ee8... +12s
  devp2p:eth Send GET_BLOCK_HEADERS message to 3.9.20.133:30303: c7830150d5018080 +5ms
  devp2p:eth Received BLOCK_HEADERS message from 3.9.20.133:30303: f9025cf90259a0d28799ec1625a5d9853afc3e10f415dd05071191af1bbf... +15ms
  devp2p:eth Send GET_BLOCK_BODIES message to 3.9.20.133:30303: e1a0dcc51743c5579b7bcdfadda2209ee2280e16e3e065d74333ba9ee836... +5ms
  devp2p:eth Received BLOCK_BODIES message from 3.9.20.133:30303: c3c2c0c0 +15ms
ERROR [02-28|18:27:21] Error [ERR_STREAM_PUSH_AFTER_EOF]: stream.push() after EOF
    at readableAddChunk (_stream_readable.js:260:30)
    at BlockFetcher.Readable.push (_stream_readable.js:213:10)
    at BlockFetcher.dequeue (/Volumes/Ethereum/ethereumjs-vm/packages/client/lib/sync/fetcher/fetcher.ts:163:17)
    at BlockFetcher.success (/Volumes/Ethereum/ethereumjs-vm/packages/client/lib/sync/fetcher/fetcher.ts:219:14)
    at /Volumes/Ethereum/ethereumjs-vm/packages/client/lib/sync/fetcher/fetcher.ts:268:44
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
  devp2p:eth Send GET_BLOCK_HEADERS message to 3.9.20.133:30303: c7830150d0048080 +28ms

@codecov

codecov Bot commented Feb 28, 2021

Copy link
Copy Markdown

Codecov Report

Merging #1132 (909b3aa) into master (3d9774d) will increase coverage by 0.77%.
The diff coverage is 77.60%.

Impacted file tree graph

Flag Coverage Δ
block 86.76% <ø> (-0.17%) ⬇️
blockchain 83.56% <ø> (ø)
client 83.41% <77.60%> (-0.87%) ⬇️
common 94.39% <ø> (-0.11%) ⬇️
devp2p 82.86% <ø> (-0.11%) ⬇️
ethash 86.56% <ø> (ø)
tx 88.24% <ø> (-0.12%) ⬇️
vm 82.78% <ø> (+2.60%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@jochem-brouwer jochem-brouwer mentioned this pull request Mar 1, 2021
4 tasks
Base automatically changed from yolov3-client to master March 5, 2021 11:04
@holgerd77 holgerd77 force-pushed the continous-syncing branch from 8aa8341 to 1626309 Compare March 5, 2021 11:08
@holgerd77

Copy link
Copy Markdown
Member

Rebased this one respectively cherry-picked the new commit.

@jochem-brouwer

Copy link
Copy Markdown
Member Author

Will take a look at the tests ASAP, they probably fail because this Promise in full sync now never resolves.

Comment thread packages/client/lib/sync/fullsync.ts Outdated
@jochem-brouwer

Copy link
Copy Markdown
Member Author

Yolo's bootnode is down =(

@holgerd77

Copy link
Copy Markdown
Member

Yes, I guess Yolo is shut down completely 😕, a bit sad, right? Maybe we want to get to a situation quickly where we can spin up our own networks between two plus clients, then we can do some proper integrated testing.

@jochem-brouwer

Copy link
Copy Markdown
Member Author

Yeah, I think we need some tooling to import custom genesis blocks so we can setup a development environment to sync with private nodes.

@holgerd77

Copy link
Copy Markdown
Member

Rebased this.

Comment thread packages/client/lib/sync/fetcher/fetcher.ts Outdated
@holgerd77

holgerd77 commented Aug 16, 2021

Copy link
Copy Markdown
Member

Short writeup on how to get to a quick test scenario with two local clients:

# 1. Start a dev1 client to fill in a first few hundred blocks, cut off quickly (goerli works more streamlined than mainnet)
npm run client:start -- --network=goerli --datadir=datadir-dev1

# 2. Restart dev1 client in a useful debug scenario and with discovery switched off, keep running
DEBUG=devp2p:ETH npm run client:start:dev1 -- --datadir=datadir-dev1

# 3. Start dev2 client from another window to connect to dev1 client, should sync these first couple of hundred blocks
rm -Rf ./datadir-dev2 && DEBUG=devp2p:ETH npm run client:start:dev1 -- --datadir=datadir-dev1

Should lead to something like this:

grafik

@acolytec3

Copy link
Copy Markdown
Contributor

Short writeup on how to get to a quick test scenario with two local clients:

# 1. Start a dev1 client to fill in a first few hundred blocks, cut off quickly (goerli works more streamlined than mainnet)
npm run client:start -- --network=goerli --datadir=datadir-dev1

# 2. Restart dev1 client in a useful debug scenario and with discovery switched off, keep running
DEBUG=devp2p:ETH npm run client:start:dev1 -- --datadir=datadir-dev1

# 3. Start dev2 client from another window to connect to dev1 client, should sync these first couple of hundred blocks
rm -Rf ./datadir-dev2 && DEBUG=devp2p:ETH npm run client:start:dev1 -- --datadir=datadir-dev1

Should lead to something like this:

grafik

Nice! I've been trying to conceive of how to approach this.

@holgerd77

Copy link
Copy Markdown
Member

Ok, this should already bring us some good way in the right direction.

I've now reworked larger parts of this PR and then integrated some more fine-grained and practical setting of the SYNCHRONIZED status. This has now been moved to config so that it is more widely available. A status update is now not working for a Promise from the Peer sync to resolve. Instead on every chain update it is checked if we are reaching a targeted head of the chain which is taken and updated from the NEW_BLOCK_HASHES methods. If the chain receives no further updates for some certain amount of time (set this to 60s for a start) the SYNCHRONIZED status is reset to false.

I am not claiming that I've always found the optimal structures here everywhere, so @jochem-brouwer and others feel free to evolve on this if you have some strong feeling for further reworking things.

Hope this is some good start though. 😄

//cc @gabrocheleau @acolytec3

Comment thread packages/client/lib/sync/fullsync.ts Outdated
this.pool.size
}`
// eslint-disable-next-line no-async-promise-executor
return await new Promise(async () => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm there is no resolve or reject on this promise

@ryanio ryanio Aug 17, 2021

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, maybe this is a place where Event.SYNC_SYNCHRONIZED would be useful, we could resolve when received if the typedoc above is still accurate: * @return Resolves when sync completed

(edit: tentatively added this, although we can reevaluate if it's appropriate here, if not please feel free to remove, but we should consider how to have at least one path to resolve(true))

Comment thread packages/client/lib/sync/lightsync.ts Outdated
@ryanio

ryanio commented Aug 17, 2021

Copy link
Copy Markdown
Contributor

Ok I was able to accomplish quite a few things, please browse my added commits for details on each one.

The integration tests still aren't fully working for me yet, my last commit helped but now [Integration:FullSync] is hanging after finishing, have to stop for today but I can continue tomorrow.

})
if (min.eqn(-1)) {
return
handleNewBlockHashes(data: [Buffer, BN][]) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, that's a lot better! 😄

return await new Promise(async () => {
if (!peer) return false
return await new Promise(async (resolve, reject) => {
if (!peer) return resolve(false)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still no Promise expert, what would happen if we skip this whole Promise wrapping? 🤔

Generally this whole Promise resolving has gotten less important with the extraction of the sync status determination by listening and acting to chain updates (I already thougth about just returning Promise<void> on this function).

We just need to make sure that in the Sync.start() call no concurrent sync() calls are triggered.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: tested this with removing the Promise wrapper, this then stops at another test in sync.spec.ts unit test, couldn't investigate further but still not completely getting why (if) this is needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's part of some promise chain that awaits for results, but I'm sure we could find an alternative architecture, I think it is probably a bit redundant now since it just awaits the SYNC'd event we can probably just listen for that wherever this promise is functionally being used.

@holgerd77

Copy link
Copy Markdown
Member

Rebased this.

@holgerd77

Copy link
Copy Markdown
Member

Ok, tests are processing further with the latest changes in 28bf3b4, still not finishing the # should sync with best peer tests in the [Integration:LightSync] tests (npm run tape -- test/integration/lightsync.spec.ts).

Comment thread packages/client/lib/sync/sync.ts Outdated
} else if (best?.les) {
targetHeight = new BN(best.les.status.headNum)
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we want to have this for directly setting the sync target height on a first iteration.

Have given this a bit more thought. The timestamp from the block is actually also not a good indicator for having the chain synced. For example for a stalled chain with 3 participants and 2 signers where we are the one signer and the other signer dropped some time ago (and we therefore would be on charge to produce a next block) there just won't be any new NEW_BLOCK_HASHES triggers that would update the chain and bring us in a synced state so that we could finally decide to start on block producing.

I guess with this combination (setting to the peer height and then soon after to the block height from NEW_BLOCK_HASHES) we should already cover a good flavour of scenarios (in the "normal" case there should also somewhat soon be a NEW_BLOCK_HASHES message coming in).

One scenario which is not covered yet though: how to decide in a two peer setup where we are the block producer and the other peer is just waiting for us that we "are synced" and therefore should actually start/continue producing blocks? 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added this in ccc0e09.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess this b24dc4a is my first take on this, to wait for some time if we find a best peer with a higher target height (and in the meantime there is also the chance that some NEW_BLOCK_HASHES come in) and otherwise set the sync target height to the local DB height and emit a SYNCHRONIZED event.

This should now also cover this 2 peer scenario and give us the basics to start on block producing.

(really tricky edge cases here though, I wonder e.g. if we should add an explicit flag to allow block production from genesis onwards to not accidentally start new chains if there is no network connectivity)

@holgerd77

Copy link
Copy Markdown
Member

Phew, I had the most intense debug session on this I guess I ever had this year, roughly three hours straight and testing close to everything to see why these tests are hanging in test/integration/fullsync.spec.ts.

I did try to get hold of the hanging processes by using ps aux and then going down with lsof -p {process.pid} which I got from this leaked-handles package, but this was not very informative.

I manually injected some console.log(obj) into the node_modules/leaked-handles sources to get the full stream objects hanging and not only this summarized version. This at least revealed that these were actually two WriteStream and one ReadStream handles hanging, still not sure if the TCP hint from leaked-handles was therefore generally misguiding or if a TCP stream is also a WriteStream.

I digged super deep into the mocks/mockserver.ts and mocks/network.ts code and entered debug output in the respective destroy() methods to see if all streams got destroyed. Seems that everything is good there.

Finally I went down all the commits. The complete hanging was actually already there after the first commit from Jochem starting the work here.


So as some side note and (re-)learning: we really really should do client work and commits on a very granular basis and extremely regularly run the tests. This is getting so hard otherwise to track down things, especially since there often additionally come things like timing/handlers/unresolved Promises issues into play.


I am now at least at a point where the tests pass every second time or so. Lol.

Hope that this can be resolved with some timer or so. Very much open for everyone to be investigated and continued.

I would have a strong tendency to then merge this PR once we have got the tests reliably running. This is otherwise getting too extensive and I think we have got a good basis here. This can then already be used to continue work e.g. on the tx pool or the block builder and things from here can also still later be revised or evolved. These kind of changes from here will also get easier to debug once we have the local chain sync functionality (so tx pool + block builder) in place. That will allow us to very easily simulate various edge cases locally.

@holgerd77 holgerd77 marked this pull request as ready for review August 20, 2021 13:54

@holgerd77 holgerd77 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I would now merge this in so that we can build on top of this in other PRs, also see my latest comment on that.

This has also been cross-reviewed along the work process by having had three people working on the PR.

@holgerd77 holgerd77 merged commit cbcb7ac into master Aug 20, 2021
@holgerd77 holgerd77 deleted the continous-syncing branch August 20, 2021 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants