Skip to content

FCU Canonical chain corruption after beacon sync (3 fixes + startup healing)#10876

Merged
svlachakis merged 44 commits intomasterfrom
canonical-fix
Mar 27, 2026
Merged

FCU Canonical chain corruption after beacon sync (3 fixes + startup healing)#10876
svlachakis merged 44 commits intomasterfrom
canonical-fix

Conversation

@svlachakis
Copy link
Copy Markdown
Contributor

@svlachakis svlachakis commented Mar 19, 2026

Fixes Closes Resolves #9906 & #10861

Changes

Fix 1: Stale canonical markers after beacon sync

BlockDownloader calls UpdateMainChain(wereProcessed: false) to mark synced blocks canonical without updating Head. This creates three cases where stale markers survive:

  1. FCU reorgs to a sibling at the same height (previousHeadNumber == lastNumber) — the unmark loop condition previousHeadNumber > lastNumber is false, so it is skipped entirely.
  2. ePBS FCU to an ancestor — the downward loop clears down to lastNumber but cannot see beacon-synced markers above the stale Head.
  3. A concurrent MoveToMain clears an intermediate level mid-scan, creating a gap — the old break on first non-canonical level stopped before clearing markers above it.

In all three cases orphaned blocks retain HasBlockOnMainChain = true, causing eth_getBlockByNumber to return the wrong block.

Fix: Added an unconditional upward scan after the downward unmark, starting from Math.Max(previousHeadNumber, lastNumber) + 1, that skips gaps rather than breaking on the first missing marker — covering all three cases in one loop.


Fix 2: PoW best-difficulty fallback returns orphaned blocks in PoS

GetBlockHashOnMainOrBestDifficultyHash fell back to highest-TD when HasBlockOnMainChain=false. In PoS all blocks share TD=0, so the fallback returned whichever block it found first — after a reorg, the orphaned one.

Fix: Post-merge (Head.TotalDifficulty ≥ TTD), return null immediately with no fallback — matching geth's ReadCanonicalHash semantics.


Fix 3: Startup repair for nodes already affected

Nodes that hit this bug before the fix may have corrupted markers on disk. A new config flag triggers a one-time repair at boot:

--Init.HealCanonicalChain=true
--Init.HealCanonicalChainDepth=8192

BlockTree.HealCanonicalChain runs two phases in a single atomic BatchWrite:

  1. Upward scan — clears any HasBlockOnMainChain=true levels above head left by the sync path.
  2. Backward walk — follows ParentHash from head for up to HealCanonicalChainDepth blocks, ensuring every ancestor has HasBlockOnMainChain=true at index 0 in its ChainLevelInfo slot.

Tests

  • UpdateMainChain_WhenBeaconSyncMarksThenReorgsToSibling_DecanonalizesDescendant — regression for Fix 1. Fails before, passes after.
  • UpdateMainChain_WhenFcuToAncestorWithStaleBeaconSyncedDescendants_ClearsAll — ePBS scenario. Fails before Math.Max, passes after.
  • UpdateMainChain_WhenGapInBeaconSyncMarkersAndReorging_ClearsStaleMarkersAcrossGap — gap race scenario. Fails with break-on-gap, passes with skip-gaps.
  • FindBlock_WhenBlockOrphanedAfterReorgInPoS_ReturnsNull — regression for Fix 2. Fails before the post-merge guard, passes after.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Testing Notes

Tested Healing functionallity on our Gnosis Archive node and it works as expected.
Added regression test on our test suite NethermindEth/nethermind-node-tests#36

@smartprogrammer93
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

Claude finished @smartprogrammer93's task in 3m 53s —— View job


Claude's Review

  • Gather context
  • Read changed files
  • Read relevant rules
  • Provide review feedback

Summary

The fix is correct and minimal. Removing the 5-line early-return restores the semantics demanded by the Engine API spec: a FCU to any already-known block must always update chain state, not just return Valid silently.


The Fix (ForkchoiceUpdatedHandler.cs)

The removed block (in the WasProcessed = true branch, after EnsureNewHead) was:

// REMOVED:
if (!IsOnMainChainBehindHead(newHeadBlock, forkchoiceState, out ResultWrapper<...>? result))
    return result;

This fired when block.Number < Head.Number && IsMainChain(block), skipping the UpdateMainChain call that de-canonicalises descendants. The existing guard below it already handles the true no-op case:

bool newHeadTheSameAsCurrentHead = _blockTree.Head!.Hash == newHeadBlock.Hash;
bool shouldUpdateHead = !newHeadTheSameAsCurrentHead && blocks is not null;
if (shouldUpdateHead)
    _blockTree.UpdateMainChain(blocks!, true, true);

This correctly avoids redundant work when FCU targets the current head while still processing ancestor FCUs.

Interesting finding: TaikoForkchoiceUpdatedHandler had already worked around this exact bug by overriding IsOnMainChainBehindHead to be a no-op (return true always). The PR correctly fixes the root cause in the base class instead.


Issues

1. The same early-return pattern survives in the !WasProcessed branch (line 170)

if (!blockInfo.WasProcessed)
{
    if (!IsOnMainChainBehindHead(newHeadBlock, forkchoiceState, out ... errorResult))
        return errorResult;   // ← still returns Valid here
    ...
    return ForkchoiceUpdatedV1Result.Syncing;
}

If a block exists in the chain with WasProcessed = false but is already marked canonical (possible in certain beacon-sync edge cases), this path also returns Valid without calling UpdateMainChain. This is a narrower scenario and likely only affects sync paths, but the same class of bug exists. Consider whether this branch also needs addressing. Investigate this →

2. IsOnMainChainBehindHead virtual method is now misleadingly positioned

The virtual method (lines 94–106) only affects the !WasProcessed path now. Its log message "already in canonical chain" combined with returning Valid is inconsistent with what that path does (all other branches in !WasProcessed return Syncing). Worth a follow-up cleanup or comment.


Test Coverage

Thorough. The regression test (fcu_to_ancestor_below_head_must_update_head_and_decanonicalise_descendants) captures the exact failure mode. The 18 BlockTreeTests unit tests cover:

  • Double reorgs and lower-height reorgs
  • SwapToMain with 3+ siblings
  • DB persistence after reload
  • IsMainChain / FindHeader / FindBlock consistency
  • wereProcessed = false marking
  • Best-difficulty fallback

Test structure and naming follow repo conventions correctly.


Verdict

Approve. The core fix is correct, minimal, and well-tested. The residual IsOnMainChainBehindHead usage in the !WasProcessed path (issue #1) is a pre-existing concern outside this PR's scope.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a forkchoice update edge case where FCU targeting an already-canonical ancestor below the current head was previously short-circuited, leaving the higher descendants incorrectly marked canonical and the head pointer unchanged. This aligns ForkchoiceUpdatedHandler.ApplyForkchoiceUpdate behavior with the intended canonical reorg semantics and adds regression/unit coverage around canonical marking.

Changes:

  • Remove the early-return optimization in ForkchoiceUpdatedHandler so UpdateMainChain runs even when FCU targets a canonical ancestor below head.
  • Add an Engine API regression test to ensure FCU to an ancestor moves head back and de-canonicalizes descendants.
  • Add a suite of BlockTree unit tests covering canonical marking/unmarking across reorg and lookup scenarios (including persistence/reload).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/Nethermind/Nethermind.Merge.Plugin/Handlers/ForkchoiceUpdatedHandler.cs Removes the processed-branch early return so FCU to a canonical ancestor triggers main-chain update/de-canonicalization.
src/Nethermind/Nethermind.Merge.Plugin.Test/EngineModuleTests.V1.cs Adds an Engine API regression test reproducing and preventing the “ancestor FCU ignored” canonical mismatch.
src/Nethermind/Nethermind.Blockchain.Test/BlockTreeTests.cs Adds canonical-marking/unit tests for reorg, lookup consistency, and persistence/reload edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@smartprogrammer93
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Mar 20, 2026

Claude finished @smartprogrammer93's task in 5m 56s —— View job


Claude finished @smartprogrammer93's task


Claude's Review

  • Gather context
  • Read rules files
  • Read changed files
  • Provide review feedback

Context

This is a different implementation from what the previous Claude review described. The current branch does not touch ForkchoiceUpdatedHandler.cs. Instead it fixes the root cause in BlockTree.UpdateMainChain and adds a startup healing path.


Summary

The fix is correct and well-targeted. The else branch in UpdateMainChain handles exactly the scenario where previousHeadNumber == lastNumber (stale head after wereProcessed=false sync), and it also correctly no-ops in the normal forward-advancing case. The HealCanonicalChain startup recovery is well-designed with atomic writes. Test coverage is thorough.


Core Fix (BlockTree.cs:1016–1034)

The else branch fires when previousHeadNumber <= lastNumber, covering two cases:

Bug scenario (previousHeadNumber == lastNumber): BlockDownloader calls UpdateMainChain([C], false) marking C canonical at H+1 without advancing Head (which stays at A at H). When FCU(B, same height H) arrives, the old previousHeadNumber > lastNumber check is false (H > H = false), so the unmark loop was skipped entirely. The else branch now scans upward from lastNumber+1, finds C's stale marker, and clears it. ✓

Normal advancing (previousHeadNumber < lastNumber): The else branch also fires here but quickly breaks — LoadLevel(lastNumber + 1) returns null or a non-canonical level on every normal block. The behavior is a no-op except for one extra DB read per block. ✓

The fix is also correct in the re-sync-after-crash case, where stale markers from a previous run are progressively cleared as the sync path re-marks blocks one by one.


Issues

1. Extra LoadLevel read on the block-processing hot path (minor)

Every UpdateMainChain call where previousHeadNumber <= lastNumber — which is every normal block-processing call — now performs one additional LoadLevel(lastNumber + 1) read. In the common case this immediately breaks (nothing found), so the overhead is a single key-value lookup per block. Given that LoadLevel results are likely cached, this is low cost in practice, but it is a measurable regression on the forward-processing hot path. Investigate this →

2. HealCanonicalChainOnStartup not auto-disabled after first run (usability)

The PR description says "use this flag once after observing a canonical-mismatch". If a user enables it and forgets to turn it off, HealCanonicalChain runs on every restart (idempotently, but wastefully scanning up to 8192 blocks). The existing info log ("Healing canonical chain from head...") communicates it's running, but it'd be safer to add a warning like "Consider setting --Init.HealCanonicalChainOnStartup false after recovery" after the heal completes. Fix this →

3. Config items missing HiddenFromDocs (minor, coding style)

HealCanonicalChainOnStartup and HealCanonicalChainDepth are recovery-oriented one-time flags. Most similar diagnostic/technical flags in IInitConfig (e.g., DisableGcOnNewPayload, InRunnerTest, ExitOnBlockNumber) carry HiddenFromDocs = true to avoid surfacing them in standard docs. These should too. Fix this →

4. Typo in test name (trivial)

UpdateMainChain_sync_marks_descendant_canonical_then_reorg_to_sibling_decanonalizes_it — "decanonalizes" should be "decanonicalizes". Fix this →


Observations

  • BatchWrite atomicity: The HealCanonicalChain implementation correctly wraps both phases (upward clear + backward repair) in a single BatchWrite. A crash before Dispose() leaves the DB unchanged — well reasoned.

  • else branch in the sync path: When BlockDownloader calls UpdateMainChain([N+1], false) one block at a time, the else branch fires (since previousHeadNumber < lastNumber) and scans from N+2. If there's a pre-existing stale marker at N+2 from a crashed run, it will be cleared — then re-marked on the next sync call. Functionally correct, adds extra write traffic in re-sync scenarios but this is acceptable.

  • RepairMarkersBelow gracefully handles pruned chains: The parent is null guard with a Logger.Warn is correct. Good defensive coding.

  • ReadOnlyBlockTree throws, BlockTreeOverlay delegates: Both are correct. PerfTest/Program.cs stub no-op is consistent with how other unrelated methods are stubbed there.

  • Copilot's "Ahmad's change" comment: Already addressed — the test comment now says "RequireCanonical-by-number behavior change" which is neutral and appropriate.

  • 26 new tests: Well-structured. Scenarios are distinct enough that they don't mechanically differ by parameters alone, so not parameterizing them is reasonable under the repo test guidelines.


Verdict

Approve with minor suggestions. The core fix is correct, minimal, and properly tested. Items 1–4 are non-blocking — item 1 is a micro-optimization opportunity, items 2–4 are polish.

@LukaszRozmej
Copy link
Copy Markdown
Member

LukaszRozmej commented Mar 20, 2026

BTW in ePBS we will get FCU a block and then we can get FCU to the ancestor (not sure if always parent) and we need to support it. How does this affect this scenario? Is it compatible with this during sync?

kamilchodola added a commit to NethermindEth/gas-benchmarks that referenced this pull request Mar 23, 2026
Add check_canonical.py that walks the chain backwards via parentHash and
compares against eth_getBlockByNumber results to detect stale canonical
markers after reorgs -- the exact bug described in NethermindEth/nethermind#10876.

Hook the check into run.sh after every setup/*.txt replay (which contains
the reorg separator block). Controlled by CANONICAL_CHECK=true env var,
enabled by default in the CI workflow.

For this temporary testing branch:
- Remove PostgreSQL DB push step from repricing-client.yml
- Remove artifact zip/upload steps from repricing-client.yml
- Remove retrigger job from repricing-nethermind.yml
LukaszRozmej and others added 3 commits March 27, 2026 14:44
* small refactors in BlockTree

* de-duplicate tests
This reverts commit b7a54d1.
@github-actions github-actions bot removed the network label Mar 27, 2026
@svlachakis svlachakis requested a review from LukaszRozmej March 27, 2026 13:13
@svlachakis svlachakis merged commit facce1b into master Mar 27, 2026
421 of 423 checks passed
@svlachakis svlachakis deleted the canonical-fix branch March 27, 2026 16:53
stdevMac pushed a commit that referenced this pull request Mar 27, 2026
…ealing) (#10876)

* FCU to canonical ancestor silently ignored, leaving descendants canonical

* fix comments

* fix comment

* change approach

* remove empty line

* HealCanonicalChain implementation & tests

* fix build

* ePBS FCU fix

* geth parity with https://github.com/ethereum/go-ethereum/blob/745b0a8c09ad9d0866da67403ffa99d11ba70ec3/core/rawdb/accessors_chain.go#L47

return null for orphaned heights post-merge in GetBlockHashOnMainOrBestDifficultyHash

* GetBlockHashOnMainOrBestDifficultyHash now returns null when
HasBlockOnMainChain=true but WasProcessed=false and blockNumber > Head.Number.

Beacon sync calls UpdateMainChain(wereProcessed=false), setting
HasBlockOnMainChain without advancing Head. If this races with a
cleanup FCU where previousHeadNumber == lastNumber, the upward scan
runs before the marker is set and cannot clear it — leaving a stale
marker that eth_getBlockByNumber would return as canonical.

The write-time scan cannot close this window; a read-time guard can.
WasProcessed=false precisely identifies beacon-sync markers: processed
canonical blocks always have WasProcessed=true, so startup/reload
paths are unaffected.

* revert change for WasProcessed=false

* test: add beacon sync + reorg stale marker reproduction test

Reproduction of the stale canonical markers bug from the Engine API
test generator: beacon sync marks H+1, H+2, H+3 canonical without
advancing Head, then FCU reorgs to a sibling at the same height as
Head. Verifies all orphaned levels are de-canonicalized.

* test: add failing gap test for beacon sync race condition

Adds UpdateMainChain_beacon_sync_gap_in_stale_markers_leaves_orphan_after_reorg
which reproduces the scenario where a concurrent MoveToMain creates a gap in
stale canonical markers during beacon sync. The break-on-first-gap upward scan
stops at the gap and leaves d3 orphaned as stale canonical.

This test FAILS on canonical-fix (expected) and PASSES on bounded-scan.

* fix: skip gaps in upward scan instead of breaking

Change Phase 2 upward scan and ClearStaleMarkersAbove to continue past
levels where HasBlockOnMainChain is false, breaking only when the level
does not exist. This handles gaps left by concurrent MoveToMain without
needing a BestKnownNumber bound.

* PR cleanup

* remove misleading comment

* minor copilot comments

* review comments

* Canonical fix refactor (#10972)

* small refactors in BlockTree

* de-duplicate tests

* Lukasz review

* fixes

* Revert "fixes"

This reverts commit b7a54d1.

* Lukasz review - Refactoring

* fixes

* revert blocktree registration

---------

Co-authored-by: Kamil Chodoła <43241881+kamilchodola@users.noreply.github.com>
Co-authored-by: Kamil Chodoła <kamil.chodola@gmail.com>
Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>
asdacap added a commit that referenced this pull request Mar 31, 2026
…d hot-path scan

PR #10876 added an unconditional upward scan in UpdateMainChain that
clears stale canonical markers left by beacon sync. The scan runs on
every call and walks until LoadLevel returns null, which is O(K) per
block during forward sync (BlockDownloader) and O(K²) total when
BlockchainProcessor falls behind — also incorrectly clearing valid
beacon-synced markers ahead of the processing front.

The scan is only needed during FCU reorgs, which is the only path that
passes forceUpdateHeadBlock: true. Guard the call so forward sync and
forward block processing skip the scan entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
asdacap added a commit that referenced this pull request Mar 31, 2026
…d hot-path scan

PR #10876 added an unconditional upward scan in UpdateMainChain that
clears stale canonical markers left by beacon sync. The scan runs on
every call and walks until LoadLevel returns null, which is O(K) per
block during forward sync (BlockDownloader) and O(K²) total when
BlockchainProcessor falls behind — also incorrectly clearing valid
beacon-synced markers ahead of the processing front.

The scan is only needed during FCU reorgs, which is the only path that
passes forceUpdateHeadBlock: true. Guard the call so forward sync and
forward block processing skip the scan entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
asdacap added a commit that referenced this pull request Mar 31, 2026
fix: guard ClearStaleMarkersAbove behind forceUpdateHeadBlock to avoid hot-path scan

PR #10876 added an unconditional upward scan in UpdateMainChain that
clears stale canonical markers left by beacon sync. The scan runs on
every call and walks until LoadLevel returns null, which is O(K) per
block during forward sync (BlockDownloader) and O(K²) total when
BlockchainProcessor falls behind — also incorrectly clearing valid
beacon-synced markers ahead of the processing front.

The scan is only needed during FCU reorgs, which is the only path that
passes forceUpdateHeadBlock: true. Guard the call so forward sync and
forward block processing skip the scan entirely.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
svlachakis added a commit that referenced this pull request Apr 3, 2026
…ealing) (#10876)

* FCU to canonical ancestor silently ignored, leaving descendants canonical

* fix comments

* fix comment

* change approach

* remove empty line

* HealCanonicalChain implementation & tests

* fix build

* ePBS FCU fix

* geth parity with https://github.com/ethereum/go-ethereum/blob/745b0a8c09ad9d0866da67403ffa99d11ba70ec3/core/rawdb/accessors_chain.go#L47

return null for orphaned heights post-merge in GetBlockHashOnMainOrBestDifficultyHash

* GetBlockHashOnMainOrBestDifficultyHash now returns null when
HasBlockOnMainChain=true but WasProcessed=false and blockNumber > Head.Number.

Beacon sync calls UpdateMainChain(wereProcessed=false), setting
HasBlockOnMainChain without advancing Head. If this races with a
cleanup FCU where previousHeadNumber == lastNumber, the upward scan
runs before the marker is set and cannot clear it — leaving a stale
marker that eth_getBlockByNumber would return as canonical.

The write-time scan cannot close this window; a read-time guard can.
WasProcessed=false precisely identifies beacon-sync markers: processed
canonical blocks always have WasProcessed=true, so startup/reload
paths are unaffected.

* revert change for WasProcessed=false

* test: add beacon sync + reorg stale marker reproduction test

Reproduction of the stale canonical markers bug from the Engine API
test generator: beacon sync marks H+1, H+2, H+3 canonical without
advancing Head, then FCU reorgs to a sibling at the same height as
Head. Verifies all orphaned levels are de-canonicalized.

* test: add failing gap test for beacon sync race condition

Adds UpdateMainChain_beacon_sync_gap_in_stale_markers_leaves_orphan_after_reorg
which reproduces the scenario where a concurrent MoveToMain creates a gap in
stale canonical markers during beacon sync. The break-on-first-gap upward scan
stops at the gap and leaves d3 orphaned as stale canonical.

This test FAILS on canonical-fix (expected) and PASSES on bounded-scan.

* fix: skip gaps in upward scan instead of breaking

Change Phase 2 upward scan and ClearStaleMarkersAbove to continue past
levels where HasBlockOnMainChain is false, breaking only when the level
does not exist. This handles gaps left by concurrent MoveToMain without
needing a BestKnownNumber bound.

* PR cleanup

* remove misleading comment

* minor copilot comments

* review comments

* Canonical fix refactor (#10972)

* small refactors in BlockTree

* de-duplicate tests

* Lukasz review

* fixes

* Revert "fixes"

This reverts commit b7a54d1.

* Lukasz review - Refactoring

* fixes

* revert blocktree registration

---------

Co-authored-by: Kamil Chodoła <43241881+kamilchodola@users.noreply.github.com>
Co-authored-by: Kamil Chodoła <kamil.chodola@gmail.com>
Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>
# Conflicts:
#	src/Nethermind/Nethermind.Init/Modules/BlockTreeModule.cs
#	src/Nethermind/Nethermind.TxPool.Test/TestBlockTree.cs
svlachakis pushed a commit that referenced this pull request Apr 3, 2026
fix: guard ClearStaleMarkersAbove behind forceUpdateHeadBlock to avoid hot-path scan

PR #10876 added an unconditional upward scan in UpdateMainChain that
clears stale canonical markers left by beacon sync. The scan runs on
every call and walks until LoadLevel returns null, which is O(K) per
block during forward sync (BlockDownloader) and O(K²) total when
BlockchainProcessor falls behind — also incorrectly clearing valid
beacon-synced markers ahead of the processing front.

The scan is only needed during FCU reorgs, which is the only path that
passes forceUpdateHeadBlock: true. Guard the call so forward sync and
forward block processing skip the scan entirely.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gnosis chain - invalid block in canonical index

9 participants