Skip to content

eth/downloader: fix invalid hash chain error due to head mini reorg#17839

Merged
karalabe merged 1 commit intoethereum:masterfrom
karalabe:downloader-invalid-hash-chain-fix
Oct 5, 2018
Merged

eth/downloader: fix invalid hash chain error due to head mini reorg#17839
karalabe merged 1 commit intoethereum:masterfrom
karalabe:downloader-invalid-hash-chain-fix

Conversation

@karalabe
Copy link
Copy Markdown
Member

@karalabe karalabe commented Oct 4, 2018

For a very long time now we're received issue reports about the synchronization failing with "invalid hash chain" errors. We managed to reproducibly hit it on Rinkeby lately, so here's the fix! 💥

The issue happens during fast sync, while the downloader is busy pulling the latest state trie. Since we added in-memory pruning about a year ago, the downloader cannot stick to a single pivot point throughout sync, rather it needs to push it forward if the chain progresses enough (since old state will become unavailable). This was handled by the downloader constantly rechecking for new headers while state sync is ongoing, slowly pushing them on top of the current queue.

A small mini reorg on the chain head however meant that the newly discovered header might actually not fit on top of the old queue, causing an invalid hash chain error, dropping the master peer. This can happen on mainnet if a previous head becomes an uncle, and this happens relatively often on Rinkeby if out-of-turn signers race for the next block.


I've tried to create a solution that's minimally invasive, since the downloader is one of our most sensitive pieces of code. The PR contains two tiny extensions to the header sync logic:

  • If we're not syncing skeletons any more, only small batches of headers (i.e. we're honing in on the head of the chain), we reach out to our local database and see where we're at sync progress wise. If there's plenty of work left (48 blocks currently), then instead of delivering all the new headers to the queue, we retain/delay the last 2. This ensures that as long as we're busy syncing, any tiny 2-block head reorgs get silently ignored.
  • After a while, the only missing headers will be the 2 we ignored, so every header fetch will just return the same 2 headers. We don't want the downloader to spin like crazy fetching these over and over again, so the second extension is that if we received headers, but have none to import (i.e. all were delayed), we sleep for 3 seconds before trying again to check for new data.

This seems to have completely fixed my Rinkeby sync, where I can do a full fast sync without hitting any invalid hash chain errors whatsoever.

Comment thread eth/downloader/downloader.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that/than

Comment thread eth/downloader/downloader.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means. If head+48 < headers[last] && len(headers) > 2 , then use headers[:-2]. So what if len(headers)== 1

Either we should pause there, or the initial check n > 0 should be n > reorgProtHeaderDelay

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headers are always fetched in batches of 128. Worst case scenario if there's exactly 1 header after a skeleton fetch, then we skip delaying and potentially hit the previous issue. I could add an extra clause to handle the case if len(headers) is less then reorgProtHeaderDelay.

Comment thread eth/downloader/downloader.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the full details about how this part works -- but you introduce a sleep here (3s) which wasn't there previously. Are you sure that's 1. sufficient time and 2. doesn't block other important things, like downloader events that cannot be delivered?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors line 826, which did the same thing, just when we fully exhausted our headers.

@karalabe karalabe merged commit 5d3b7bb into ethereum:master Oct 5, 2018
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants