eth/downloader: fix invalid hash chain error due to head mini reorg#17839
eth/downloader: fix invalid hash chain error due to head mini reorg#17839karalabe merged 1 commit intoethereum:masterfrom karalabe:downloader-invalid-hash-chain-fix
Conversation
There was a problem hiding this comment.
that means. If head+48 < headers[last] && len(headers) > 2 , then use headers[:-2]. So what if len(headers)== 1
Either we should pause there, or the initial check n > 0 should be n > reorgProtHeaderDelay
There was a problem hiding this comment.
Headers are always fetched in batches of 128. Worst case scenario if there's exactly 1 header after a skeleton fetch, then we skip delaying and potentially hit the previous issue. I could add an extra clause to handle the case if len(headers) is less then reorgProtHeaderDelay.
There was a problem hiding this comment.
I don't know the full details about how this part works -- but you introduce a sleep here (3s) which wasn't there previously. Are you sure that's 1. sufficient time and 2. doesn't block other important things, like downloader events that cannot be delivered?
There was a problem hiding this comment.
This mirrors line 826, which did the same thing, just when we fully exhausted our headers.
For a very long time now we're received issue reports about the synchronization failing with "invalid hash chain" errors. We managed to reproducibly hit it on Rinkeby lately, so here's the fix! 💥
The issue happens during fast sync, while the downloader is busy pulling the latest state trie. Since we added in-memory pruning about a year ago, the downloader cannot stick to a single pivot point throughout sync, rather it needs to push it forward if the chain progresses enough (since old state will become unavailable). This was handled by the downloader constantly rechecking for new headers while state sync is ongoing, slowly pushing them on top of the current queue.
A small mini reorg on the chain head however meant that the newly discovered header might actually not fit on top of the old queue, causing an invalid hash chain error, dropping the master peer. This can happen on mainnet if a previous head becomes an uncle, and this happens relatively often on Rinkeby if out-of-turn signers race for the next block.
I've tried to create a solution that's minimally invasive, since the downloader is one of our most sensitive pieces of code. The PR contains two tiny extensions to the header sync logic:
This seems to have completely fixed my Rinkeby sync, where I can do a full fast sync without hitting any invalid hash chain errors whatsoever.